pub-crawler
Crawler User-Agent:pub-crawler
🤖 Overview
The pub-crawler is a web crawler operated by the Common Crawl Foundation, a non-profit organization that maintains and distributes a free, open repository of web crawl data. Its primary purpose is to systematically collect publicly accessible web pages to build the Common Crawl dataset, which is used for research, machine learning training, and digital preservation. The crawler is open-source software hosted on GitHub at github.com/commoncrawl/pub-crawler and is the engine that powers the monthly Common Crawl archives.
🌐 Technical Behavior
The pub-crawler employs a breadth-first crawling strategy, starting from seed URLs drawn from the DMOZ directory and other curated lists. It issues HTTP GET requests sequentially, typically at a rate of one request per second per IP address to avoid overwhelming target servers, though the exact rate can be configured by the operator. The crawler runs on a fleet of AWS EC2 instances, with IP addresses ranging across multiple Amazon-owned blocks (e.g., 54.x.x.x, 52.x.x.x), and rotates through hundreds of IPs over the course of a crawl. It uses HTTP/1.1 with persistent connections and respects the Last-Modified header to avoid re-crawling unchanged content. The crawler also supports the If-Modified-Since and ETag headers for efficient incremental crawling.
📋 robots.txt Compliance
The pub-crawler strictly honors the robots.txt protocol as documented in its source code (see github.com/commoncrawl/pub-crawler/blob/master/crawler/robots.py). It parses Disallow, Allow, and Crawl-delay directives, and the Common Crawl project explicitly states that they respect all robot exclusion rules. Evidence from the project’s FAQ (commoncrawl.org/faq) confirms that site owners can block the crawler completely via User-agent: pub-crawler in robots.txt.
🔍 Detection Indicators
The primary User-Agent string is pub-crawler/1.0 (http://commoncrawl.org/faq/). In some deployments, the crawler may also identify as CCBot/2.0 (https://commoncrawl.org/faq/) or Mozilla/5.0 compatible with additional Common Crawl tokens. Behavioral fingerprints include a consistent request pattern: one request per second with no concurrent connections from the same IP, and the presence of Accept: */* and From: [email protected] headers in the request.
📊 Data Usage
All data collected by pub-crawler is made publicly available as the Common Crawl dataset, which is used to train large language models (e.g., GPT variants, BERT), academic research in natural language processing, web archaeology, and search engine optimization analysis. The dataset is also used by archive.org and other digital libraries for preserving the web’s historical content. No data is sold; the non-profit Foundation distributes it freely under a Creative Commons license.
⚙️ Rate Limiting Policy
Because the pub-crawler can generate sustained traffic across many IPs during a monthly crawl cycle, rate‑limiting is recommended to protect server resources. Common Crawl itself encourages site operators to implement a Crawl-delay in robots.txt or use IP‑based throttling thresholds (e.g., 5 requests per second from a single IP) to balance data access with server stability, as the crawler will honor these delays and back off.
Similar Threats
53% of Web Traffic Is Bots in 2026
— Imperva Bad Bot Report 2026
How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server — completely free.
📊 Get My Bot ReportSign up in seconds · No card required
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.