IDBot

Bot User-Agent: idbot

🤖 Overview

IDBot is a web crawler operated by ID Intelligence Inc., a company specializing in large-scale web data collection for training proprietary AI models and enrichment of enterprise knowledge graphs. First publicly documented in early 2024, the bot focuses on indexing publicly accessible text, images, and metadata from news, academic, and e‑commerce domains. According to the company’s official blog (id‑intelligence.com/blog/idbot‑announcement), IDBot supports the development of their ID‑Core language model and associated retrieval‑augmented generation (RAG) pipelines.

🌐 Technical Behavior

IDBot sends HTTP/1.1 requests with a default user‑agent header of IDBot/1.0 and a variable crawl delay ranging from 5 seconds to 30 seconds between requests, depending on server response times. Its IP ranges are allocated from ASN 398751 (ID‑Data) and ASN 41283 (ID‑Cloud), with addresses disclosed in the IDBot IP list (github.com/ID‑Intelligence/idbot‑ips). The crawler uses ETag and If‑Modified‑Since headers to reduce redundant fetches and respects gzip encoding. Analysis of server logs (sourced from the same GitHub repository) shows IDBot typically crawls a maximum of 50 pages per domain per day, throttling aggressively when encountering 429 Too Many Requests responses.

📋 robots.txt Compliance

IDBot respects all Disallow directives found in robots.txt, as confirmed by the company’s technical documentation (id‑intelligence.com/docs/robots‑policy). The bot also obeys Crawl‑Delay directives with a minimum delay of 10 seconds, and it does not attempt to bypass Allow overrides. Official testing by the company using a public validation tool (github.com/ID‑Intelligence/robots‑validator) demonstrated zero violations across 10,000 random domains.

🔍 Detection Indicators

Primary User‑Agent string: Mozilla/5.0 (compatible; IDBot/1.0; +https://id‑intelligence.com/bot). Secondary variations include IDBot‑Image/1.0 for image fetches. Behavioral fingerprints include a consistent request pattern of exactly 3 simultaneous connections per host, and a mandatory Accept‑Language: en‑US,en;q=0.9 header. The bot always includes the custom header X‑ID‑Crawl‑ID with a UUID for traceability.

📊 Data Usage

Collected data is used exclusively for training ID‑Core language models and improving the ID‑Graph enterprise knowledge‑base service. No personal or copyrighted content is stored beyond 30 days after model training, per the company’s privacy policy (id‑intelligence.com/privacy). The data is also used to generate synthetic benchmarks for RAG accuracy testing.

⚙️ Rate Limiting Policy

Rate‑limiting IDBot is recommended because even compliant bots can overwhelm small servers when multiple instances crawl simultaneously. The policy rationale is to enforce a fair‑share model: limiting requests to 100 per hour per IP preserves server resources while allowing legitimate indexing.

🛡️

Stop Bots. Save Bandwidth. Protect Revenue.

Boteraser automatically detects and blocks unwanted bots — protecting your site from scrapers, DDoS bursts, and credential stuffing attacks without slowing down real visitors.

✅ Start Free Protection

Setup takes under a minute  ·  Free trial available

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.