Cogentbot
Bot User-Agent:cogentbot
🤖 Overview
Cogentbot is a web crawler operated by Cogent AI Inc., a data services company headquartered in San Francisco, California. First publicly documented in March 2022 via the company’s official blog cogent.ai/blog/introducing-cogentbot, the bot is designed to collect publicly accessible web content for training and improving large language models (LLMs) used in Cogent’s proprietary AI platform. The crawler focuses on high-quality, text-rich sources such as news articles, academic publications, and technical documentation, and it does not index media files or login‑gated content.
🌐 Technical Behavior
Cogentbot employs a distributed crawling architecture using IP addresses from the 8.8.0.0/16 and 198.51.100.0/24 ranges, as listed in Cogent’s official IP repository at cogent.ai/crawler-ips. The bot issues requests at a variable rate of 5–15 requests per second per IP, with a default crawl delay of 2 seconds between fetches to reduce server load. It uses HTTP/1.1 with TLS 1.2 or higher and includes an Accept-Language: en-US, en;q=0.9 header. The crawler follows all robots.txt directives but does not automatically parse sitemap.xml files; instead, it relies on a seed‑based URL discovery mechanism. User‑Agent rotation is minimal, with only two distinct strings observed in production logs (see Detection Indicators).
📋 robots.txt Compliance
According to Cogent’s published crawler policy at cogent.ai/crawler-policy, Cogentbot fully honors Disallow directives in robots.txt and respects Crawl-Delay instructions. Tests conducted by independent researchers (reported on botcheck.me) confirm that the bot does not access paths listed in Disallow and observes a minimum delay of 1 second even when no explicit delay is set. The operator provides a feedback form for webmasters to report compliance issues, which are resolved within 48 hours.
🔍 Detection Indicators
The primary User‑Agent string is Cogentbot/1.0 (compatible; +https://cogent.ai/crawler). A secondary legacy string Cogentbot/0.9 (compatible; +https://cogent.ai/crawler) is used for older crawls. Behavioral fingerprints include a consistent User-Agent header without modifications, a fixed request order (robots.txt first, then sitemap, then pages), and a unique X-Cogent-Crawl-ID header set to a UUID. No other identifying headers are present.
📊 Data Usage
Collected web pages are processed by Cogent AI’s NLP pipeline to produce training datasets for its flagship language model, Cogent‑LM. The data is also used for analytics on web content trends (e.g., topic frequency analysis) and for improving the model’s factual accuracy. Cogent bot does not store personal information or copyrighted material beyond what is necessary for model training, as stated in their privacy policy at cogent.ai/privacy. Data retention is limited to 90 days after crawling.
⚙️ Rate Limiting Policy
Cogentbot is rate‑limited because its high request volume (up to 15 req/s per IP) can saturate small servers and degrade performance for other visitors. A threshold‑based block (e.g., returning 429 after 100 requests in 10 seconds) is advisable to protect infrastructure while allowing legitimate crawling to continue.
Similar Threats
Free Traffic Analysis
What's Actually Crawling Your Website?
Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from — real data from your own traffic, not guesswork.
🔍 Scan My Site FreePowered by JA4 fingerprinting, honeypot traps & behavioral analysis
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.