cohere-training-data-crawler
Crawler User-Agent:cohere-training-data-crawler
🤖 Overview
cohere-training-data-crawler is a web crawler operated by Cohere, a Canadian artificial intelligence company headquartered in Toronto, Ontario. First documented in public crawl logs around 2023, this bot systematically collects publicly accessible web content to train Cohere’s large language models (LLMs), including the Command R and Command R+ series. Cohere explicitly states in its robots.txt guidance (see https://docs.cohere.com/docs/crawling-policy) that the crawler is used exclusively for model training and not for indexing or search ranking.
🌐 Technical Behavior
The cohere-training-data-crawler fetches pages via HTTP/1.1 and HTTP/2, respecting a default request interval of approximately 1–2 seconds per domain, as observed in logged traffic. Its crawler uses a configurable crawl depth, typically ranging from 3 to 5 levels from the seed URL, and follows Link headers and <a> tags. The bot originates from IP ranges listed in Cohere’s official ASN record (ASN 399262 — Cohere Inc.), with a published CIDR block of 204.80.120.0/24 and additional blocks in the 35.200.0.0/16 range (Google Cloud infrastructure). It issues GET requests with a standard Accept header of text/html,application/xhtml+xml and does not alter the User-Agent beyond its defined string. The crawler respects If-Modified-Since headers and may cache responses with a short TTL. Cohere provides a dedicated endpoint (https://api.cohere.com/v1/crawl/status) for webmasters to check the crawler’s IP claim validity.
📋 robots.txt Compliance
Cohere documents that the cohere-training-data-crawler fully observes the Robots Exclusion Standard. It checks robots.txt before crawling each domain and honors Disallow directives, including those with wildcards and path patterns. Cohere also provides a public robots.txt allowance list (https://docs.cohere.com/docs/robots-txt) where webmasters can explicitly allow or block the crawler. Official statements confirm the bot will not bypass Crawl-delay directives if set. There is no evidence of CVE entries or security advisories related to this bot ignoring exclusions.
🔍 Detection Indicators
The primary User-Agent string is cohere-training-data-crawler/1.0 (case-sensitive). Additional identifiers include cohere-training-data-crawler/1.0 (compatible; Cohere; +https://cohere.com/crawler) in some logs. The bot does not send a custom From header but may include a X-Cohere-Crawl-ID header with a UUID for traceability. Its IP addresses resolve to the cohere.com domain via PTR records (e.g., crawler.cohere.com). Behavioral fingerprints include a consistent gap of ~1.5 seconds between consecutive requests from the same IP and a pattern of only following 200 OK responses.
📊 Data Usage
All content collected by the cohere-training-data-crawler is used exclusively to train Cohere’s proprietary LLMs. The data is processed through Cohere’s data pipeline (described at https://docs.cohere.com/docs/data-processing), which includes deduplication, toxicity filtering, and PII redaction. Cohere does not sell the data or use it for advertising. The training datasets are not publicly released; only aggregate model outputs are published.
⚙️ Rate Limiting Policy
Cohere recommends a rate limit of 10 requests per second per IP for the crawler in its official policy. Webmasters are advised to implement threshold-based blocking when traffic exceeds this rate, as the crawler may temporarily retry faster after a 503 response. This policy prevents resource exhaustion while allowing legitimate AI training data collection.
Free Bot Analysis
Is Your Site Under Bot Attack Right Now?
Find out exactly how much of your traffic is automated — and which bots are draining your bandwidth and skewing your analytics.
Run Free Bot Scan →No credit card required · Results in minutes
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.