Diffbot
Bot User-Agent:diffbot
🤖 Overview
Diffbot is a web crawler operated by Diffbot Inc., a company founded in 2009 by Mike Tung that specializes in AI-driven web data extraction. The bot is designed to crawl and parse web pages into structured data using computer vision and natural language processing, feeding into Diffbot’s Knowledge Graph API, which powers semantic search, analytics, and AI training datasets. According to Diffbot’s official documentation at docs.diffbot.com, the crawler indexes billions of pages to populate their graph of entities, relationships, and facts, serving customers in finance, e-commerce, and research.
🌐 Technical Behavior
Diffbot’s crawler, often identified by the User-Agent string “Mozilla/5.0 (compatible; Diffbot/1.0; +http://www.diffbot.com)”, typically sends requests at a moderate frequency but can become aggressive when scheduled for large-scale indexing. Based on Diffbot’s published IP ranges (e.g., 45.33.32.0/20 and 72.14.192.0/18 as listed in their support articles), the bot uses both IPv4 and IPv6 addresses. It follows HTTP/1.1 and HTTPS protocols, and employs a combination of GET and POST requests when interacting with APIs. The crawler respects standard robots exclusion rules but may also request pages at rates exceeding typical human browsing, especially during batch processing jobs where it can hit over 100 requests per minute per IP.
📋 robots.txt Compliance
Diffbot officially states in its support documentation (diffbot.com/support) that it honors Disallow directives found in robots.txt files, provided the directives are syntactically correct and placed in the root directory. However, the company notes that because the bot is used for commercial data extraction, webmasters with restrictive robots.txt may still receive partial compliance, and they offer a custom robots.txt override service for enterprise customers. In practice, Diffbot’s crawler checks robots.txt at the beginning of each crawl session and caches the file for up to 24 hours.
🔍 Detection Indicators
The primary detection indicator is the User-Agent string: “Diffbot/1.0” or variations like “Diffbot/2.0”, often accompanied by a reference URL (http://www.diffbot.com). Behavioral fingerprints include rapid sequential requests to the same domain within minutes, frequent requests for CSS, JavaScript, and image files (though it mostly requests HTML), and a low percentage of browser-like headers (e.g., missing Accept-Language or Referer headers). Diffbot’s IP blocks can be identified by reverse DNS entries like diffbot.com or via WHOIS lookups listing Diffbot Inc.
📊 Data Usage
Collected data is used to build and maintain Diffbot’s Knowledge Graph, a semantic database that extracts entities, relationships, and attributes from web pages. This graph is utilized for AI training (e.g., fine-tuning language models on structured facts), real-time analytics for e-commerce product categorization, and enrichment of business intelligence platforms. Diffbot also offers a Search API that allows third parties to query the extracted data, as documented in their API reference (docs.diffbot.com/docs/api/search).
⚙️ Rate Limiting Policy
Because Diffbot’s crawler can dispatch hundreds of requests per minute during scheduled jobs, it is typically rate-limited by web servers to prevent resource exhaustion. Threshold-based blocking—e.g., limiting to 10 requests per second from a single IP—is recommended to maintain site availability while still allowing legitimate indexing, as Diffbot’s terms of service at diffbot.com/terms acknowledge that customers are expected to configure crawl rates responsibly.
Similar Threats
⚠️
Your Site May Be Hemorrhaging Revenue to Bots
Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected — completely free.
Check My Site for FreeFree to start · Cancel anytime
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.