Anthill Bot — Detection, Blocking & Technical Analysis

Anthill

Bot User-Agent: anthill

🤖 Overview

Anthill is a web crawler operated by Anthropic, the company behind Claude AI. First documented in public user-agent lists and Anthropic's official documentation around 2023, its primary purpose is to index publicly accessible web content for training and improving Anthropic's large language models, including the Claude series. Unlike some other AI crawlers that also power search features, Anthill is solely dedicated to AI training data acquisition.

🌐 Technical Behavior

Anthill performs HTTP/1.1 GET requests with a crawl frequency that Anthropic describes as respecting server capacity, typically throttling to a few requests per second per domain. Its IP ranges originate from Anthropic's cloud infrastructure, which includes addresses allocated to AWS and possibly Google Cloud, though no fixed public IP list is published. The bot follows standard crawl patterns by traversing internal links and sitemaps, using persistent connections and honoring Last-Modified and ETag headers to avoid re-downloading unchanged content. Anthropic's official documentation notes that the crawler does not include a From header but does present a recognizable user-agent string.

📋 robots.txt Compliance

Anthropic explicitly states in their AI Crawlers support page that Anthill fully respects robots.txt directives, including Disallow and Crawl-delay instructions. This compliance is verified through community reports and industry analysis by bot management firms like Cloudflare and Imperva. If a site operator blocks the user-agent string Anthill in their robots.txt, the crawler will skip those paths entirely. Anthropic also provides a webform for site owners to request exclusion beyond robots.txt.

🔍 Detection Indicators

The primary user-agent string is Mozilla/5.0 (compatible; Anthill/1.0; +https://anthropic.com/ai-crawler), though variations may include different versions or trailing comments. Behavioral fingerprints include a consistent crawl pattern with no JavaScript rendering and a lack of common browser headers like Accept-Language. Reverse DNS lookups on crawling IPs often resolve to *.anthropic.com or generic cloud provider hostnames. Some site operators have additionally observed requests with a User-Agent containing Claude-Web but that is a separate bot; Anthill always identifies itself as Anthill.

📊 Data Usage

Content collected by Anthill feeds directly into Anthropic's training pipeline for their Claude models, including Claude 3 and future iterations. The data is used to improve model understanding of language, facts, and reasoning gleaned from public web pages. Anthropic's privacy policy states that they do not use personal data for model training without explicit consent, and they provide an opt-out mechanism for site owners. Unlike search engines, Anthill does not index data for public retrieval—it is purely for internal model improvement.

⚙️ Rate Limiting Policy

Rate-limiting Anthill is advisable to prevent excessive load on your origin server, as Anthropic's crawler does not implement aggressive backoff on its own beyond respecting Crawl-delay. A threshold of 10–20 requests per minute per IP from identified Anthill user-agents is a reasonable starting point. This policy ensures that your infrastructure remains responsive to human users while the crawler still collects the representative sample needed for AI training.

Similar Threats

🛡️

Stop Bots. Save Bandwidth. Protect Revenue.

Boteraser automatically detects and blocks unwanted bots — protecting your site from scrapers, DDoS bursts, and credential stuffing attacks without slowing down real visitors.

✅ Start Free Protection

Setup takes under a minute · Free trial available

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.

Anthill

🤖 Overview

🌐 Technical Behavior

📋 robots.txt Compliance

🔍 Detection Indicators

📊 Data Usage

⚙️ Rate Limiting Policy

Stop Bots. Save Bandwidth. Protect Revenue.

Company

Resources

Services

Trusted

Subscribe