dwaar

Bot User-Agent: dwaar

🤖 Overview

Dwaar is a web crawler operated by Diffbot, a company specializing in AI-powered data extraction and knowledge graph construction. First deployed around 2015, Dwaar systematically indexes publicly accessible web pages to populate Diffbot’s Knowledge Graph and to support its Extract and Analyze APIs, which enable developers to retrieve structured data from raw HTML. Diffbot describes Dwaar as a “high-frequency” crawler designed to keep its knowledge base current by revisiting pages daily or weekly.

🌐 Technical Behavior

Dwaar uses a multi‑threaded, distributed crawling architecture hosted primarily on Amazon Web Services (AWS) EC2 instances, with IP ranges that frequently change and span multiple AWS regions. According to Diffbot’s operational documentation, the bot can issue hundreds of requests per second to a single domain, making it one of the most aggressive legitimate crawlers. It follows HTTP/1.1 and HTTPS protocols, respects Last‑Modified and ETag headers for conditional requests, and sends a From header containing [email protected] for contact. The crawl pattern prioritizes links discovered via sitemaps, anchor tags, and external references, with a focus on text‑heavy, publicly accessible pages rather than multimedia or login‑gated content.

📋 robots.txt Compliance

Diffbot officially states that Dwaar fully respects robots.txt directives, including Disallow rules, and provides a dedicated Crawl‑Delay interpretation. However, independent audits (e.g., WebmasterWorld threads from 2017–2023) note that Dwaar may initially ignore Disallow during its first pass before adjusting, and it does not honor Allow directives when they conflict with a Disallow pattern. The company recommends explicit Disallow entries for sensitive paths and advises contacting [email protected] if compliance issues arise.

🔍 Detection Indicators

The primary User‑Agent string is Dwaar and Dwaar/1.0; older versions also used Mozilla/5.0 (compatible; Dwaar/1.0; +https://www.diffbot.com/crawler). Distinctive headers include From: [email protected] and a User‑Agent that never includes browser‑like tokens. Reverse‑DNS lookups on Dwaar IPs typically resolve to *.compute.amazonaws.com or *.ec2.internal. Log analysis may reveal bursts of identical TCP connection timings and repetitive HTTP 200 responses with no referrer header.

📊 Data Usage

Collected data is ingested into Diffbot’s proprietary Knowledge Graph, which underpins their AI‑driven Extract and Analyze APIs. Content is parsed, tagged, and stored as structured entities (e.g., articles, products, discussions) used for natural language processing, semantic search, and training Diffbot’s custom language models. Diffbot publicly states that no personal or copyright‑protected material is retained beyond what is publicly visible, and that data is not resold as raw datasets.

⚙️ Rate Limiting Policy

Dwaar is rate‑limited because its aggressive crawl rate—often exceeding hundreds of requests per second—can degrade server performance for smaller websites. The recommended policy is to apply threshold‑based blocking (e.g., limit to 10 requests per second per IP) while still allowing the bot through at lower speeds, ensuring legitimate data collection continues without harming site availability.

⚠️

Your Site May Be Hemorrhaging Revenue to Bots

Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected — completely free.

Check My Site for Free

Free to start  ·  Cancel anytime

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.