robozilla
Bot User-Agent:robozilla
🤖 Overview
Robozilla is a web crawler operated by the Mozilla Foundation, originally developed around 2010 to test the performance and rendering capabilities of Firefox by crawling a large sample of the web. The crawler later became a primary contributor to the Common Crawl open repository of web crawl data, which is widely used by researchers, data scientists, and AI training pipelines. Mozilla publicly documents Robozilla on its official wiki and it is listed in Common Crawl’s official bot registry as a legitimate, non‑commercial crawler whose output is freely available under the Common Crawl license.
🌐 Technical Behavior
Robozilla performs deep, breadth‑first crawls with an emphasis on capturing full page content including HTML, CSS, JavaScript, and images. The crawler respects a default delay of approximately 1 request per second per domain, though this may be reduced during large‑scale Common Crawl snapshots that span billions of pages. Requests are made over HTTP/1.1 and support gzip compression. The originating IP addresses belong to Mozilla’s autonomous system (AS36844) and are geolocated primarily in the United States and Europe. Robozilla uses a custom, multithreaded crawling engine that stores raw content in WARC (Web ARChive) files, which are then publicly hosted on Amazon S3. The crawler does not execute JavaScript but retrieves server‑rendered responses, making it similar in behavior to other archive‑oriented bots like Heritrix.
📋 robots.txt Compliance
According to Mozilla’s official documentation and Common Crawl guidelines, Robozilla fully honors robots.txt directives, including Disallow, Crawl‑Delay, and Allow rules. The Common Crawl team also provides a dedicated opt‑out mechanism via a robots.txt entry specifically targeting “robozilla” to allow webmasters to exclude their sites from the crawl. Public audits have shown Robozilla’s compliance record to be among the highest of all large‑scale crawlers.
🔍 Detection Indicators
The primary User‑Agent strings used by Robozilla are Robozilla/1.0 and Robozilla/2.0, often accompanied by a contact email such as crawler‑[email protected] in the HTTP From header. Some older versions also use Robozilla/0.9. The crawler typically identifies itself with a User‑Agent that includes the substring “Robozilla”. It may also send a custom X‑Robots‑Tag header set to “noarchive” when instructed, but this is rare. Server logs can be filtered by the AS number 36844 or by reverse DNS lookups showing a “crawl*.mozilla.org” suffix.
📊 Data Usage
The primary purpose of Robozilla’s crawl data is to produce the Common Crawl dataset, a massive, publicly available corpus of web pages used for academic research, natural language processing, and machine learning model training. Organizations such as OpenAI, Google, and Hugging Face have used Common Crawl snapshots for training large language models. The data is also used by Mozilla internally to benchmark Firefox performance on real‑world web content. No personal information is collected; only publicly accessible pages are saved.
⚙️ Rate Limiting Policy
Robozilla is rate‑limited because its crawl volumes (often millions of pages per day) can temporarily overwhelm smaller servers or shared hosting environments. Webmasters are advised to enforce a Crawl‑Delay directive in robots.txt (recommended at 2–5 seconds) to reduce load, and if the crawler ignores this, a rate limit of 10 requests per minute per IP is a reasonable threshold.
Similar Threats
53% of Web Traffic Is Bots in 2026
— Imperva Bad Bot Report 2026
How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server — completely free.
📊 Get My Bot ReportSign up in seconds · No card required
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.