rtgibot
Bot User-Agent:rtgibot
🤖 Overview
rtgibot is a web crawler operated by RTGI Inc. (Real-Time Global Intelligence), an AI data company officially documented at rtgi.com. Its primary purpose is to systematically index publicly accessible web content to build large-scale training datasets for machine learning models, particularly large language models (LLMs) and retrieval-augmented generation (RAG) systems. The bot was first observed in early 2023 and is explicitly listed in the robots.txt exclusion standard, with many sites having reported its presence in server logs.
🌐 Technical Behavior
RTGIBot uses a custom, non-standards-compliant HTTP client based on Python’s Requests library, sending requests with a fixed user-agent string. According to documented crawl patterns shared in webmaster forums, the bot typically makes between 10 and 30 requests per minute per host, but in aggressive configurations can burst up to 100 requests per minute. It does not respect the Crawl-Delay directive natively, relying instead on the site’s own rate limiting. IP ranges used by RTGIBot are primarily drawn from the AS210350 (RTGI-DC) and AS197068 (RTGI-CLOUD) blocks, often originating from data centers in the United States, Germany, and Singapore. The bot always requests text/html content and occasionally follows robots.txt links, but it does not cache or compress responses. It operates over HTTP/1.1 with a keep‑alive timeout of 10 seconds and does not set the Accept-Encoding header, indicating a lack of compression support.
📋 robots.txt Compliance
RTGIBot officially honors Disallow directives as stated in its documentation at rtgi.com/robots. However, evidence from security advisories (e.g., SANS ISC diary entry from July 2023) shows that the bot may ignore Crawl-Delay and sometimes fails to re‑read updated robots.txt files, leading to repeated crawling of disallowed paths for up to 24 hours. In practice, webmasters have reported that adding a custom User-agent: rtgibot block and repeating the Disallow directive for each restricted path is effective.
🔍 Detection Indicators
The primary detection indicator is the User-Agent string: rtgibot/1.0 (+https://rtgi.com/bot). Secondary fingerprints include the absence of the Referer header, a fixed Accept: text/html header, and the use of a single IP per session. The bot also sets a custom X-RTGI-Crawler header with a numeric session ID. Behavioral fingerprints include requesting robots.txt before every new domain and then making rapid, sequential page requests without randomising the order.
📊 Data Usage
Collected data is used solely for training RTGI’s proprietary language models and providing aggregated analytics to enterprise customers. According to RTGI’s privacy policy (rtgi.com/privacy), the company does not sell raw data but processes it to generate synthetic training examples, improve RAG pipelines, and fine‑tune domain-specific LLMs. The data is also used to identify emerging topics and trends in publicly available web content.
⚙️ Rate Limiting Policy
RTGIBot is rate-limited because its high request volume can degrade server performance for small websites and because its inconsistent robots.txt compliance may cause access to unintended resources. The recommended policy is a threshold-based block at 40 requests per minute per IP with a 10‑second burst limit, which balances legitimate data collection with server protection.
Similar Threats
⚠️
Your Site May Be Hemorrhaging Revenue to Bots
Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected — completely free.
Check My Site for FreeFree to start · Cancel anytime
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.