www collector

Bot User-Agent: www-collector

🤖 Overview

www collector is a web crawler operated by the Common Crawl Foundation, a non-profit organization that maintains a free, open repository of web crawl data. Its primary purpose is to systematically archive publicly accessible web pages for inclusion in the Common Crawl dataset, which supports research, machine learning, and AI training applications. The bot is designed to collect large-scale, representative samples of the web rather than targeted content.

🌐 Technical Behavior

The bot employs a breadth-first crawl strategy, typically starting from a seed set of URLs and following links recursively. Requests are made over HTTP/1.1 and HTTP/2, with a crawl frequency that can reach up to 10 requests per second per IP address during peak operation. According to Common Crawl's documentation, the bot uses IP ranges registered under the ASN 396982 (Common Crawl) and often resolves to addresses within the 54.192.0.0/10 and 52.84.0.0/15 blocks. It respects the Crawl-Delay directive in robots.txt but ignores Disallow for certain non-sensitive paths. The bot identifies itself via the User-Agent header as CCBot/2.0 (https://commoncrawl.org/faq/), though variations like CCBot/3.0 exist.

📋 robots.txt Compliance

According to the official Common Crawl FAQ, the www collector bot fully honors Disallow directives in robots.txt for paths explicitly blocked. It also respects Crawl-Delay instructions, with a default delay of 1 second if not specified. However, the bot's volume can still overwhelm small sites even with compliance due to its distributed nature and high concurrency.

🔍 Detection Indicators

The primary User-Agent string is CCBot/2.0 (https://commoncrawl.org/faq/), with additional variations like CCBot/3.0 and Mozilla/5.0 compatible; CCBot/2.0. Behavioral fingerprints include a high frequency of requests from contiguous IPv4 addresses within the same /24 subnet, no JavaScript execution, and a tendency to request robots.txt before each crawl session. The bot does not send any custom From or X-Robots-Tag headers.

📊 Data Usage

The collected data is used to build and maintain the Common Crawl dataset, which is freely downloadable for use in natural language processing (NLP) tasks, large language model (LLM) training, academic research, and web analytics. For instance, the dataset has been used to train models like GPT-2 and BERT. Common Crawl also provides derivatives like the Common Crawl Index for open web search.

⚙️ Rate Limiting Policy

While www collector is a legitimate crawler, its high request rate and distributed IP pool can cause resource contention on smaller servers. Rate limiting with thresholds (e.g., 10 requests per second per IP) is recommended to prevent service degradation, as the bot's scale is intended for broad crawling rather than single-site optimization.

Free Traffic Analysis

What's Actually Crawling Your Website?

Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from — real data from your own traffic, not guesswork.

🔍 Scan My Site Free

Powered by JA4 fingerprinting, honeypot traps & behavioral analysis

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.