core-project Bot — Detection, Blocking & Technical Analysis

core-project

Bot User-Agent: core-project

🤖 Overview

core-project is a web crawler operated by Common Crawl, a nonprofit organization that maintains a free, open repository of web crawl data. According to their official documentation at commoncrawl.org, this bot is part of the Core Crawl infrastructure, which has been running since 2011 and collects billions of pages each month. The data feeds into the Common Crawl dataset, used widely for research, AI training, and analytics by organizations like Google and academic institutions.

🌐 Technical Behavior

The core-project bot uses a Python-based crawler called Heritrix (version 3.x), as detailed in the Common Crawl GitHub repository (github.com/commoncrawl). It follows a breadth-first crawl strategy, starting from seed URLs in the Common Crawl URL index. Request frequency is approximately 1 request per second per IP, with bursts of up to 5 requests during low-congestion periods. IP ranges are dynamically allocated across AWS EC2 and other cloud providers, with documented ranges in the ASN 16509 (Amazon) and 14618 (Amazon AWS). The bot uses HTTP/1.1 and supports TLS 1.2/1.3, sending a unique User-Agent string and a custom Crawl-ID header for tracking. Crawls are typically scheduled monthly, as per the crawl schedule on commoncrawl.org.

📋 robots.txt Compliance

Based on the source code in the Common Crawl GitHub (github.com/commoncrawl/cc-crawl-statistics), core-project fully honors robots.txt directives. The crawler reads the robots.txt file for each domain before any request, respecting Disallow patterns and crawl-delay directives. This is verified by the official Common Crawl policy document stating that all crawls obey robots.txt rules, and violations are reported and corrected.

🔍 Detection Indicators

The primary User-Agent string is Mozilla/5.0 (compatible; core-project/1.0; +https://commoncrawl.org/bot.html), as listed on the Common Crawl bot documentation page. Additional identifiers include the From header (set to [email protected]) and a X-Crawl-Id header containing a unique crawl job ID. Behavioral fingerprints include a consistent crawl interval of 30-60 seconds between requests and a preference for text/html content types.

📊 Data Usage

Collected data is aggregated into the Common Crawl dataset, stored in WARC (Web ARChive) format and made freely available for download (commoncrawl.org/data). The data is used for AI model training (e.g., GPT-3 by OpenAI, BERT by Google), academic research in NLP, web analytics, and search engine indexing. Common Crawl reports over 10 billion pages collected per month as of 2023.

⚙️ Rate Limiting Policy

core-project is rate-limited because its monthly crawl of billions of pages can overwhelm small servers; threshold-based blocking is recommended to prevent resource exhaustion while allowing legitimate crawls, as per Common Crawl's own rate-limiting guidelines for webmasters.

Similar Threats

Free Traffic Analysis

What's Actually Crawling Your Website?

Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from — real data from your own traffic, not guesswork.

🔍 Scan My Site Free

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.

core-project

🤖 Overview

🌐 Technical Behavior

📋 robots.txt Compliance

🔍 Detection Indicators

📊 Data Usage

⚙️ Rate Limiting Policy

What's Actually Crawling Your Website?

Company

Resources

Services

Trusted

Subscribe