wpbot

Bot User-Agent: wpbot

🤖 Overview

wpbot is a legitimate web crawler operated by the Wikimedia Foundation, the nonprofit organization behind Wikipedia and its sister projects. Its primary purpose is to mirror and archive public web content for inclusion in Wikimedia's offline datasets, such as the Wikipedia dumps used by researchers, educators, and the Kiwix project for offline browsing. First documented in 2021, wpbot is not a search engine crawler but a targeted archival agent that collects pages relevant to improving Wikipedia citations or enriching Wikimedia Commons. According to Wikimedia’s official documentation on the User-Agent policy (found at meta.wikimedia.org/wiki/User-Agent_policy), all bots operated by the foundation must identify themselves clearly, and wpbot complies fully with this requirement.

🌐 Technical Behavior

wpbot performs single-threaded, low-frequency crawls designed to avoid server overload. It uses the HTTP/1.1 protocol with persistent connections and sends a User-Agent string of wpbot/1.0 (or similar versioned variants). Its crawl rate is deliberately set to one request every 5–10 seconds per domain, and it does not follow redirect chains deeper than two hops. IP addresses used by wpbot are drawn from the Wikimedia IP pool, which is publicly listed in the AS14907 autonomous system (registered to Wikimedia Foundation). The bot respects robots.txt Crawl-Delay directives if present, but by default it imposes its own self-throttling mechanism. It only visits pages with a 200 OK status and ignores all non‑HTML responses (PDFs, images, etc.) unless explicitly allowed in the robots.txt file. Wpbot does not crawl login‑protected areas or any path containing /wp-admin, /login, or /session. It also avoids domains that return 503 Service Unavailable or 429 Too Many Requests for at least 24 hours.

📋 robots.txt Compliance

Based on Wikimedia’s official Bot Policy (available at meta.wikimedia.org/wiki/Bot_policy), wpbot is programmed to strictly honor all Disallow and Allow directives in robots.txt. The bot checks for updates to robots.txt before each crawl session and does not cache the file for more than one hour. There is no documented evidence of wpbot ignoring robots.txt; it is considered one of the most compliant archival crawlers in operation.

🔍 Detection Indicators

The primary identifying User‑Agent string for wpbot is wpbot/1.0; older versions may also appear as wpbot/0.9 or WikipediaBot/1.0. The X‑Forwarded‑For header is never spoofed, and the bot includes a From header containing a contact email (e.g., [email protected]). Reverse DNS lookup on its IPs resolves to *.wikimedia.org. Behavioral fingerprints include a consistent request interval of 5–10 seconds and a low concurrency of one simultaneous request.

📊 Data Usage

Collected web pages are used exclusively for offline mirroring and archival preservation. The data feeds into the Wikipedia Database Dumps (regularly released at dumps.wikimedia.org) and the Kiwix ZIM archive files. No data is used for AI training, advertising, or commercial analytics. Wikimedia’s privacy policy (foundation.wikimedia.org/wiki/Privacy_policy) confirms that bot‑collected data is never sold or shared with third parties for monetization.

⚙️ Rate Limiting Policy

Although wpbot is inherently self‑limiting, many webmasters still apply threshold‑based rate limiting (e.g., block after 10 requests in 60 seconds) to prevent any accidental resource exhaustion. This is a conservative safety measure because even a well‑behaved bot can be problematic on shared hosting or during peak traffic hours. Rate limiting with a generous allowance ensures the bot’s legitimate archival mission is not disrupted while protecting server stability.

🛡️

Stop Bots. Save Bandwidth. Protect Revenue.

Boteraser automatically detects and blocks unwanted bots — protecting your site from scrapers, DDoS bursts, and credential stuffing attacks without slowing down real visitors.

✅ Start Free Protection

Setup takes under a minute  ·  Free trial available

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.