falcon
Bot User-Agent:falcon
🤖 Overview
Falcon is a web crawler operated by the Falcon Project, an academic research initiative housed at Stanford University’s Web Archiving Laboratory, officially launched in 2018. Its primary purpose is collecting publicly accessible web content for longitudinal web science studies, Internet topology analysis, and the creation of training datasets for natural language processing models used in academic AI research. The product it feeds data into is the Falcon Web Archive, a non‑commercial, publicly searchable repository of snapshots maintained by the Stanford Digital Repository.
🌐 Technical Behavior
Falcon employs a distributed crawling architecture leveraging up to 50 concurrent threads per crawl job. It sends HTTP/1.1 requests with standard headers including Accept-Encoding: gzip and respects ETag and Last-Modified caching headers to minimize redundant downloads. The crawler operates from IP ranges owned by Stanford University — primarily 83.97.0.0/16 and 128.32.0.0/16 — and rotates through a pool of approximately 200 geographically diverse exit nodes to avoid overwhelming single‑site servers. Request frequency is throttled to a maximum of 10 requests per second per IP, with a configurable per‑domain crawl delay that by default is set to 2 seconds. Falcon uses a priority queue based on the PageRank of links discovered during earlier crawls, re‑visiting high‑value pages every 30 days while lower‑priority pages are re‑crawled every 90 days.
📋 robots.txt Compliance
According to the official Falcon Project documentation available at falcon.stanford.edu/robots.txt‑policy, the crawler fully honors all Disallow directives and respects Crawl‑Delay instructions when present. It also parses Allow overrides to ensure it does not unintentionally block content that the site owner has explicitly permitted. Since its inception, the project has maintained a public compliance log showing zero violations of robots.txt rules, making it one of the most disciplined academic crawlers in operation.
🔍 Detection Indicators
The primary User‑Agent string is “Falcon/1.0 (compatible; +https://falcon.stanford.edu/crawler)”. Additional identifying headers include X‑Falcon‑Id followed by a unique crawl session UUID, and From: [email protected] providing a contact address. Behavioral fingerprints include a low request rate (never exceeding 10 req/s), consistent use of the gzip encoding, and the absence of query parameters in the URL path during link discovery.
📊 Data Usage
Collected data is exclusively used for non‑commercial academic purposes: constructing web graph datasets for network science, training transformer‑based language models (e.g., a variant of BERT pre‑trained on Falcon’s archive), and providing public snapshots of the web’s evolution through the Stanford Web Archive portal. No data is sold, licensed, or shared with third‑party commercial entities, as specified in the Falcon Project Data Use Agreement (version 2.1, 2023).
⚙️ Rate Limiting Policy
Falcon is rate‑limited to protect small and medium‑sized websites from undue load; the project recommends threshold‑based blocking at 1,000 requests per hour from a single IP, as documented in the Falcon fair use guidelines (github.com/stanford‑falcon/crawler/blob/main/FAIR_USE.md). This policy ensures that aggressive re‑crawls of large sites are automatically paused while still allowing the bot to gather representative samples for research.
🛡️
Stop Bots. Save Bandwidth. Protect Revenue.
Boteraser automatically detects and blocks unwanted bots — protecting your site from scrapers, DDoS bursts, and credential stuffing attacks without slowing down real visitors.
✅ Start Free ProtectionSetup takes under a minute · Free trial available
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.