metauri

Bot User-Agent: metauri

🤖 Overview

MetaURI is a web crawler operated by Meta Platforms Inc. (formerly Facebook) as part of its artificial intelligence research infrastructure. First publicly documented in 2023 alongside Meta’s LLaMA model family, the bot collects publicly accessible text and metadata from the open web to train large language models and improve Meta’s internal search and recommendation systems. Its primary product feed is the Meta AI training pipeline, which powers features like generative AI assistants and knowledge graph enrichment across Facebook, Instagram, and WhatsApp.

🌐 Technical Behavior

MetaURI uses a distributed crawling architecture with requests sourced from Meta’s own ASN (AS32934) and additional IP blocks listed in the Meta ExternalAgent documentation. Crawl frequency is moderate—typically 1–3 requests per second per IP—but can spike during batch updates. The bot fetches both HTML and structured data (JSON-LD, RDFa) via HTTP/1.1 and HTTPS, and respects Cache-Control headers. User-Agent strings include MetaURI/1.0 and Mozilla/5.0 (compatible; MetaURI/1.0; +https://developers.facebook.com/docs/sharing/bot), as confirmed by Meta’s official bots page. It does not execute JavaScript or parse dynamically loaded content, relying on server-side rendering only.

📋 robots.txt Compliance

Meta officially states that MetaURI fully respects Disallow directives in robots.txt, as documented on Meta’s developer page for web crawlers. Independent testing by Cloudflare and security researchers confirms that the bot checks robots.txt before each crawl session and abides by Crawl-delay instructions. However, Meta warns that patterns using wildcards (*) may not be interpreted as expected; explicit full paths are recommended for blocking.

🔍 Detection Indicators

The primary User-Agent string is MetaURI/1.0, often appended with a contact URL. Reverse DNS lookups reveal hostnames ending in .fbsv.net or .facebook.com, and the IP ranges belong to Meta’s own AS32934 block (e.g., 31.13.24.0/21). HTTP headers frequently include From: [email protected] and a valid User-Agent field. Behavioral fingerprints include a low request rate, no support for cookies or session IDs, and a consistent order-of-magnitude crawl delay between successive requests.

📊 Data Usage

Collected data—including page text, titles, headings, and structured metadata—is used exclusively to train Meta’s generative AI models (LLaMA-2, LLaMA-3) and to improve Meta’s internal knowledge graph for entity disambiguation. According to Meta’s privacy policy, personally identifiable information (PII) is automatically stripped during preprocessing. The data is not shared with third parties or used for advertising targeting.

⚙️ Rate Limiting Policy

MetaURI is rate-limited as a proactive defense against monopolization of server resources; webmasters are advised to set moderate thresholds (e.g., 5 requests per second per IP) via robots.txt or web application firewalls. The policy rationale is to ensure fair access for all legitimate bots while preventing any single crawler from degrading site performance during peak loads.

Free Traffic Analysis

What's Actually Crawling Your Website?

Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from — real data from your own traffic, not guesswork.

🔍 Scan My Site Free

Powered by JA4 fingerprinting, honeypot traps & behavioral analysis

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.