Curious
Bot User-Agent:curious
🤖 Overview
Curious is a web crawler operated by the non-profit Common Crawl Foundation, first introduced in a January 2024 blog post. Its primary purpose is to collect publicly accessible web content for the foundation’s open‑source AI training datasets, which are used by researchers and companies to train large language models and other machine learning systems. The crawler feeds data directly into the Common Crawl corpus, a widely referenced resource in natural language processing and AI research.
🌐 Technical Behavior
The Curious crawler uses a custom Python framework built on the asyncio library, sending HTTP/1.1 and HTTP/2 requests with a maximum concurrency of 50 simultaneous connections. It respects the robots.txt crawl‑delay directive and defaults to a minimum interval of 1 second between requests to the same host. IP addresses are drawn from the Common Crawl Foundation’s own ASN (AS 396356) and are published in a weekly JSON file at https://commoncrawl.org/curious-ips. The crawler uses a breadth‑first traversal strategy, prioritizing links discovered from seed lists derived from the DMOZ directory and the Wayback Machine’s index. It does not follow infinite redirect loops and discards pages larger than 2 MB.
📋 robots.txt Compliance
Curious strictly adheres to all Disallow directives in robots.txt, as documented on the Common Crawl wiki page at https://github.com/commoncrawl/curious-crawler. The crawler’s official documentation states that it will not access any URL path explicitly forbidden, and it also respects wildcard patterns. However, it does not honor Crawl‑Delay if set below its default 1‑second minimum, considering that value as a floor.
🔍 Detection Indicators
The standard User‑Agent string is Mozilla/5.0 (compatible; CuriousBot/1.0; +https://commoncrawl.org/bot). Additionally, the crawler sends the header From: [email protected] and a X‑Curious‑ID header containing a unique job identifier. Behavioral fingerprints include a request rate of no more than 2 requests per second per IP under normal conditions, and the absence of JavaScript rendering.
📊 Data Usage
Collected data is stored in WARC format and made publicly available via Amazon S3 as part of the Common Crawl monthly snapshots. The data is used primarily for AI training, academic research, and building search indices. Common Crawl’s 2024 dataset, for example, included over 250 billion web pages contributed by Curious and other crawlers.
⚙️ Rate Limiting Policy
Curious is rate‑limited because its crawling capacity of up to 50 concurrent threads can overwhelm smaller servers if not controlled. The policy rationale for threshold‑based blocking is to protect origin server resources while still allowing the bot to collect the diversity of content needed for open AI datasets.
Similar Threats
⚠️
Your Site May Be Hemorrhaging Revenue to Bots
Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected — completely free.
Check My Site for FreeFree to start · Cancel anytime
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.