magpie-crawler

Crawler User-Agent: magpie-crawler

🤖 Overview

Magpie-Crawler is a web crawler operated by Anthropic, an AI safety company based in San Francisco, California. It was first publicly documented in early 2024 and is designed to systematically collect publicly accessible web content used to train and improve Anthropic’s Claude series of large language models. The crawler’s name references the magpie bird, known for collecting diverse objects, mirroring the crawler's role in gathering a broad corpus of text from the open web. Official documentation on Anthropic’s website (https://docs.anthropic.com/en/docs/faq#what-is-the-magpie-crawler) confirms its legitimate purpose and states that it is not used for search indexing, advertising, or analytics.

🌐 Technical Behavior

Magpie-Crawler issues HTTP GET requests with a moderate crawl rate, typically a few requests per second per domain, and dynamically adapts speed based on server response times. It supports both HTTP/1.1 and HTTP/2 protocols and sends standard headers including Accept: text/html, application/xhtml+xml and Accept-Language: en-US,en;q=0.5. The crawler follows robots.txt directives and respects Crawl-Delay headers when present. IP addresses originate from Anthropic’s cloud infrastructure, primarily Amazon Web Services (AWS) regions us-east-1 and us-west-2, as observed in server logs and verified through reverse DNS lookups. Anthropic has not published a fixed IP range list, but operators can identify requests via reverse DNS on ec2-*.compute-1.amazonaws.com or ec2-*.us-west-2.compute.amazonaws.com domains. The crawler uses a persistent connection pool and respects Retry-After headers when receiving HTTP 429 (Too Many Requests) responses.

📋 robots.txt Compliance

Anthropic explicitly states in its FAQ that Magpie-Crawler fully honors robots.txt Disallow directives. The crawler reads the file at the start of each crawl session and re-evaluates it regularly, typically every 24 hours, to respect any updates. Numerous site operators have confirmed in public reports (e.g., on Hacker News and Stack Overflow) that adding User-agent: Magpie-Crawler with Disallow: / immediately stops crawling activity. No evidence of deliberate non‑compliance has been documented.

🔍 Detection Indicators

Definitive identification relies on the HTTP User-Agent string: Mozilla/5.0 (compatible; Magpie-Crawler/1.0; +https://magpie-crawler.anthropic.com). Some requests may also carry a Via header indicating Anthropic’s proxy infrastructure. Behavioral fingerprints include a consistently low request rate (under 5 requests per second per IP), absence of JavaScript execution, and a strict preference for HTML content. Log entries show Accept: text/html, application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 and Connection: keep-alive. The crawler does not spoof other user agents and always identifies itself clearly.

📊 Data Usage

Collected text data is used exclusively for training and refining Anthropic’s suite of AI models, including Claude 3 (Opus, Sonnet, Haiku) and future iterations. The corpus is processed to build training datasets that improve natural language understanding, reasoning, and safety alignment. Anthropic states that no personal identifying information is intentionally collected, and the crawler avoids login-protected or paywalled content. Data is not shared with third parties, sold, or used for advertising.

⚙️ Rate Limiting Policy

Magpie-Crawler is rate-limited by web administrators to prevent excessive load that could degrade site performance or increase bandwidth costs. Although the crawler is inherently polite, threshold-based blocking (e.g., returning 429 after X requests per minute) provides an additional layer of control for sites that experience multiple concurrent crawlers or have resource constraints. The policy is a standard precaution to ensure fair resource allocation, not because the bot itself is aggressive.

🛡️

Stop Bots. Save Bandwidth. Protect Revenue.

Boteraser automatically detects and blocks unwanted bots — protecting your site from scrapers, DDoS bursts, and credential stuffing attacks without slowing down real visitors.

✅ Start Free Protection

Setup takes under a minute  ·  Free trial available

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.