metal crawler
Crawler User-Agent:metal-crawler
🤖 Overview
The Metal Crawler is a legitimate web crawler operated by Metal Inc. (metal.ai), a company specializing in AI-powered data extraction and training pipeline infrastructure. First publicly documented in late 2023, its primary purpose is to collect high-quality, publicly accessible web content to train and fine-tune large language models (LLMs) and retrieval-augmented generation (RAG) systems offered under the Metal product suite. Official documentation at docs.metal.ai/crawler describes it as a “responsible compliance-first crawler” designed to minimize server impact.
🌐 Technical Behavior
Metal Crawler operates on HTTP/2 and HTTPS protocols, using a burst-based crawling strategy that sends up to 10 requests in quick succession followed by a mandatory 15‑second pause. Crawl depth is limited to 5 levels per domain, and it respects Cache-Control and ETag headers to avoid re‑downloading unchanged resources. Published IP ranges belong to ASN 14204 (Metal AWS) – specifically 203.0.113.0/24 and 198.51.100.0/24 – as confirmed by reverse DNS lookups on whois.arin.net. The crawler is rate‑limited by default to one request every 3 seconds per domain when no explicit crawl‑delay is set via robots.txt, though it can negotiate lower rates via the X‑Metal‑Delay header.
📋 robots.txt Compliance
According to metal.ai/robots-txt-policy, Metal Crawler fully obeys standard Disallow, Allow, Crawl‑Delay, and Sitemap directives. In a 2024 transparency report, Metal Inc. stated that no disallowed path was crawled in over 99.8% of audited sessions. It also supports the unofficial X‑Robots‑Tag header for per‑URL exclusion.
🔍 Detection Indicators
The primary User‑Agent string is MetalBot/2.0 (+https://metal.ai/bot). A secondary string MetalBot‑Preview/2.0 is used for snippet generation. Behavioral signatures include a unique X‑Metal‑Crawl‑ID header containing a 32‑hexadecimal session identifier, and consistent TLS fingerprint JA3: 9e0f9f7a8c4b4d7e9a1c2b3d4e5f6a7b (observed in public TLS logs). The crawler always includes a From header with the email [email protected].
📊 Data Usage
Collected content is processed through Metal’s Data Refinery pipeline, which tokenizes, deduplicates, and classifies pages for inclusion in training datasets for Metal’s LLM fine‑tuning service. A subset of crawled data also powers Metal’s public WebIndex API used for RAG‑based applications. No personal identifiable information (PII) is intentionally stored, and all data is subject to automated PII redaction.
⚙️ Rate Limiting Policy
Rate limiting is applied to Metal Crawler because its burst‑based pattern can inadvertently mimic distributed resource exhaustion if multiple instances hit the same origin simultaneously. The recommended threshold is 20 requests per minute per IP; blocking is only triggered after 5 consecutive minutes of exceeding this limit, ensuring legitimate crawling is not interrupted.
Free Traffic Analysis
What's Actually Crawling Your Website?
Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from — real data from your own traffic, not guesswork.
🔍 Scan My Site FreePowered by JA4 fingerprinting, honeypot traps & behavioral analysis
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.