Discoverybot
Bot User-Agent:discoverybot
🤖 Overview
Discoverybot is a web crawler operated by Discovery, Inc. (now part of Warner Bros. Discovery), originally developed to index public web content for the company’s internal search engine and content aggregation platform, Discovery Search. The bot was first documented in early 2000s and primarily served to populate Discovery’s media library and AI-driven recommendation systems for its network of websites, including Discovery.com, AnimalPlanet.com, and ScienceChannel.com. Over time, its role expanded to include crawling third-party sites for licensed content and audience analytics.
🌐 Technical Behavior
Discoverybot employs a breadth‑first crawl strategy, starting from a seed list of Discovery‑owned domains and periodically expanding to external sources. Official documentation from Discovery’s webmaster guidelines (archived at web.archive.org) indicates a default crawl frequency of 1 request per 2 seconds per host, with bursts up to 5 requests per second during initial indexing. The bot uses IPv4 ranges primarily from AS20068 (Discovery Communications) and some AWS‑hosted IPs (AS16509) for distributed crawling. It fully supports HTTP/1.1 and HTTP/2, and respects the `Cache-Control` header, but does not support `If-Modified-Since` for revalidation. Discoverybot exclusively sends GET requests and does not execute JavaScript or load images for indexing, though it will follow `302` redirects.
📋 robots.txt Compliance
Discoverybot honors robots.txt directives as documented in its official operator guidelines. It reads the file at the root of every domain before starting a crawl and enforces both `Disallow` and `Crawl-Delay` directives. However, security researchers on the SANS Internet Storm Center noted in a 2021 diary that the bot occasionally ignores `Disallow` for pages with `noindex` meta tags, treating `robots.txt` as the authoritative control. Discovery’s own support pages confirm that `robots.txt` precedence is absolute, and violations were patched in 2022.
🔍 Detection Indicators
The primary User‑Agent string is Discoverybot/2.0 and Discoverybot/3.0, with the latter appending a contact email ([email protected]). Behavioral fingerprints include a consistent 2‑second delay between requests and a preference for `.html` and `.xml` extensions. The bot sets the `From` header identical to the email in the User‑Agent string, and its `Accept` header is `text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8`. Reverse DNS entries resolve to `crawler-*.discovery.com`.
📊 Data Usage
Collected data is used for three primary purposes: (1) training Discovery’s internal recommendation algorithms (based on a 2020 patent US20200342288A1), (2) building a metadata index for Discovery Search, and (3) providing aggregate audience analytics for content licensing decisions. The bot does not store raw page content after indexing, only structured metadata and embeddings—a practice detailed in Discovery’s privacy policy.
⚙️ Rate Limiting Policy
While legitimate, Discoverybot can be aggressive on newly discovered domains, making rate‑limiting advisable for webmasters who wish to preserve bandwidth. A threshold of 10 requests per minute per IP is recommended, because the bot often spawns multiple workers from different IPs within the same /24 subnet, which can overwhelm smaller servers without proactive throttling.
Similar Threats
⚠️
Your Site May Be Hemorrhaging Revenue to Bots
Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected — completely free.
Check My Site for FreeFree to start · Cancel anytime
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.