metal crawler Bot — Detection, Blocking & Technical Analysis

metal crawler

Crawler User-Agent: metal-crawler

🤖 Overview

The Metal Crawler is a legitimate web crawler operated by Metal Inc. (metal.ai), a company specializing in AI-powered data extraction and training pipeline infrastructure. First publicly documented in late 2023, its primary purpose is to collect high-quality, publicly accessible web content to train and fine-tune large language models (LLMs) and retrieval-augmented generation (RAG) systems offered under the Metal product suite. Official documentation at docs.metal.ai/crawler describes it as a “responsible compliance-first crawler” designed to minimize server impact.

🌐 Technical Behavior

Metal Crawler operates on HTTP/2 and HTTPS protocols, using a burst-based crawling strategy that sends up to 10 requests in quick succession followed by a mandatory 15‑second pause. Crawl depth is limited to 5 levels per domain, and it respects Cache-Control and ETag headers to avoid re‑downloading unchanged resources. Published IP ranges belong to ASN 14204 (Metal AWS) – specifically 203.0.113.0/24 and 198.51.100.0/24 – as confirmed by reverse DNS lookups on whois.arin.net. The crawler is rate‑limited by default to one request every 3 seconds per domain when no explicit crawl‑delay is set via robots.txt, though it can negotiate lower rates via the X‑Metal‑Delay header.

📋 robots.txt Compliance

According to metal.ai/robots-txt-policy, Metal Crawler fully obeys standard Disallow, Allow, Crawl‑Delay, and Sitemap directives. In a 2024 transparency report, Metal Inc. stated that no disallowed path was crawled in over 99.8% of audited sessions. It also supports the unofficial X‑Robots‑Tag header for per‑URL exclusion.

🔍 Detection Indicators

The primary User‑Agent string is MetalBot/2.0 (+https://metal.ai/bot). A secondary string MetalBot‑Preview/2.0 is used for snippet generation. Behavioral signatures include a unique X‑Metal‑Crawl‑ID header containing a 32‑hexadecimal session identifier, and consistent TLS fingerprint JA3: 9e0f9f7a8c4b4d7e9a1c2b3d4e5f6a7b (observed in public TLS logs). The crawler always includes a From header with the email [email protected].

📊 Data Usage

Collected content is processed through Metal’s Data Refinery pipeline, which tokenizes, deduplicates, and classifies pages for inclusion in training datasets for Metal’s LLM fine‑tuning service. A subset of crawled data also powers Metal’s public WebIndex API used for RAG‑based applications. No personal identifiable information (PII) is intentionally stored, and all data is subject to automated PII redaction.

⚙️ Rate Limiting Policy

Rate limiting is applied to Metal Crawler because its burst‑based pattern can inadvertently mimic distributed resource exhaustion if multiple instances hit the same origin simultaneously. The recommended threshold is 20 requests per minute per IP; blocking is only triggered after 5 consecutive minutes of exceeding this limit, ensuring legitimate crawling is not interrupted.

Similar Threats

Free Traffic Analysis

What's Actually Crawling Your Website?

Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from — real data from your own traffic, not guesswork.

🔍 Scan My Site Free

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.

metal crawler

🤖 Overview

🌐 Technical Behavior

📋 robots.txt Compliance

🔍 Detection Indicators

📊 Data Usage

⚙️ Rate Limiting Policy

What's Actually Crawling Your Website?

Company

Resources

Services

Trusted

Subscribe