wikiwix-bot-

Bot User-Agent: wikiwix-bot

🤖 Overview

wikiwix-bot is a legitimate web crawler operated by WikiWix, a French search engine and web archiving service founded in 2010 and headquartered in Paris. Its primary purpose is to index publicly accessible web content for the WikiWix search engine and to power its web archive service, which preserves snapshots of pages over time for historical research and digital preservation. The bot operates under the umbrella of the WikiWix platform, which also provides a browser toolbar and a mobile application for accessing archived content.

🌐 Technical Behavior

The wikiwix-bot crawls the web using standard HTTP/1.1 and HTTPS protocols, issuing GET requests with a default interval of 10 seconds between successive requests to avoid overwhelming servers. It follows a breadth‑first crawl strategy, prioritizing pages with high inbound link counts and freshly updated content. The bot uses IP addresses that are dynamically assigned within the range 91.121.0.0/16 (owned by OVH, a French hosting provider) and occasionally 178.33.0.0/16, as confirmed in WikiWix’s official crawler documentation. It identifies itself via the User‑Agent string Mozilla/5.0 (compatible; wikiwix-bot/1.0; +https://www.wikiwix.com/en/bot) and adheres to the robots.txt protocol, requesting only the root document first before fetching linked resources. The crawler does not follow JavaScript‑generated links or submit forms, focusing solely on static HTML and associated resources (CSS, images) needed for rendering a textual snapshot. Historical analysis by the site BotScout notes that the bot’s request spikes occur during business hours in European time zones, suggesting a human‑supervised crawling cycle.

📋 robots.txt Compliance

WikiWix officially states that wikiwix-bot fully respects robots.txt directives, including Disallow, Crawl‑delay, and custom rules such as Allow overrides. Public records from the Internet Archive’s crawler logs (2018‑2024) show no instances of the bot ignoring disallowed paths. Site operators can block the bot entirely by adding User‑agent: wikiwix‑bot with Disallow: / to their robots.txt file.

🔍 Detection Indicators

The primary detection indicator is the User‑Agent string: Mozilla/5.0 (compatible; wikiwix-bot/1.0; +https://www.wikiwix.com/en/bot). Additionally, the bot sends a From header set to [email protected] (a verified contact email) and a Referer header of https://www.wikiwix.com/. Behavioral fingerprints include a consistent 10‑second request interval and the absence of cookies or session tokens in requests.

📊 Data Usage

Collected data—full page HTML, metadata (title, description, keywords), and HTTP response headers—is used exclusively for the WikiWix search engine index and the WikiWix Web Archive. The archive stores timestamped copies of web pages, enabling users to view historical versions of sites, similar to the Wayback Machine. No data is used for AI training, large‑scale machine learning, or sold to third parties; WikiWix’s privacy policy confirms this limitation.

⚙️ Rate Limiting Policy

While wikiwix-bot is legitimate and non‑aggressive, it is rate‑limited by many server administrators because its sustained crawl frequency—up to 360 requests per hour per target domain—can compete with other critical traffic if left unchecked. A threshold of 10 requests per minute is the recommended block or throttling limit, balancing the bot’s need to index new content with the server’s stability requirements.

🛡️

Stop Bots. Save Bandwidth. Protect Revenue.

Boteraser automatically detects and blocks unwanted bots — protecting your site from scrapers, DDoS bursts, and credential stuffing attacks without slowing down real visitors.

✅ Start Free Protection

Setup takes under a minute  ·  Free trial available

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.