Citoid

Bot User-Agent: citoid

🤖 Overview

Citoid is a web crawler and metadata-harvesting service operated by the Wikimedia Foundation, developed as part of the VisualEditor and the Citation tool ecosystem. It is designed to automatically retrieve structured bibliographic metadata (e.g., author, title, date, publisher, DOI) from URLs inserted by editors on Wikipedia and other Wikimedia projects, enabling the automatic generation of formatted citations. Citoid was publicly announced in 2014 and is primarily used to streamline the citation workflow in MediaWiki-based sites, reducing manual data entry while maintaining citation accuracy.

🌐 Technical Behavior

Citoid operates by sending HTTP GET requests to URLs provided by users, typically through the VisualEditor’s citation dialog. It first attempts to extract metadata via embedded JSON-LD, microdata, RDFa, or Open Graph tags; if these are absent, it falls back to parsing HTML meta tags or using heuristic extraction rules. The crawler does not recursively follow links — it only fetches the single URL requested. Request frequency is variable but generally respects reasonable delays; however, because Citoid may be triggered by many editors simultaneously, it can generate bursts of requests to a single domain. Citoid uses the IP range belonging to Wikimedia Foundation servers (e.g., 91.198.174.0/24, 208.80.152.0/22, and others listed in AS14907). It communicates exclusively over HTTPS and does not support session cookies or authentication.

📋 robots.txt Compliance

According to official Wikimedia documentation (available at mediawiki.org/wiki/Citoid), Citoid honors robots.txt Disallow directives and will not fetch a URL if robots.txt blocks the User-Agent string Citoid. The service checks the remote server’s robots.txt before every request and respects the Crawl-delay directive if specified. Site operators can block Citoid entirely by including User-agent: Citoid Disallow: / in their robots.txt.

🔍 Detection Indicators

The primary identification string is the User-Agent header, which appears as Citoid/1.0 (https://www.mediawiki.org/wiki/Citoid). Older versions may use variations like Citoid/0.9. The crawler does not send a custom From or X-Robots-Tag header, but will include a standard Accept: text/html,application/xhtml+xml,application/xml;q=0.9. Behavioral fingerprints include single-URL fetches without deep crawling, and the absence of follow-up requests for embedded resources (images, scripts). The source IP will always resolve to a Wikimedia Foundation-owned subnet.

📊 Data Usage

Citoid’s collected metadata is used exclusively to generate on-demand citations within Wikimedia editing interfaces. The fetched data is stored temporarily (typically for caching up to one week) in the Wikimedia infrastructure to improve response times for repeated requests. No permanent storage of the crawled content is performed, and the metadata is not used for AI training, search indexing, or analytics. The only product consuming Citoid data is the MediaWiki VisualEditor and related citation tools.

⚙️ Rate Limiting Policy

Because Citoid can be triggered by many editors within a short time, it is subject to rate limiting to prevent accidental denial-of-service to target sites. The recommended policy for webmasters is to set a Crawl-delay of at least 1 second in robots.txt or to block the bot entirely if its aggregated request rate exceeds acceptable thresholds. Rate-limiting is justified by Citoid’s bursty nature — not by malicious intent — and threshold-based blocking ensures the service remains non-disruptive.

53% of Web Traffic Is Bots in 2026

— Imperva Bad Bot Report 2026

How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server — completely free.

📊 Get My Bot Report

Sign up in seconds  ·  No card required

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.