AIWebIndex
Bot User-Agent:aiwebindex
🤖 Overview
AIWebIndex is a web crawler operated by AI Web Index Ltd., a UK‑based company established in 2022, designed to systematically index publicly accessible web pages for the purpose of training large language models and building domain‑specific knowledge corpora. According to the official documentation available at aiwebindex.com/docs, the bot feeds data into the company’s proprietary AI training pipeline, which powers their “IndexGPT” product series.
🌐 Technical Behavior
The crawler operates using a distributed fleet hosted primarily on AWS EC2 instances in the us‑east‑1 and eu‑west‑1 regions, with IP ranges published in the ASN AS16509 and AS14618. AIWebIndex maintains a maximum concurrency of 4 threads per IP and respects the Crawl‑Delay directive in robots.txt with a default delay of 10 seconds. It sends HTTP requests with Accept‑Encoding: gzip, deflate and uses TLS 1.3 for all connections. The bot rotates user‑agent strings every 1,000 requests to avoid fingerprinting and follows a breadth‑first crawl strategy with a 24‑hour revisitation interval for unchanged pages.
📋 robots.txt Compliance
AIWebIndex fully honors Disallow directives as verified by independent testing published on GitHub (github.com/ai‑web‑index/robots‑tester). It caches robots.txt for up to one hour and re‑fetches upon any 301 redirect. The only known exception is that it ignores Allow override directives that conflict with a broader Disallow rule, aligning with the original robots.txt specification.
🔍 Detection Indicators
The primary User‑Agent string is AIWebIndex/2.0 and Mozilla/5.0 (compatible; AIWebIndex/2.0; +https://aiwebindex.com/crawler). The bot also sends a custom HTTP header X‑Crawler‑ID: aiwebindex‑v2 and a From header containing [email protected]. Behavioral fingerprints include a consistent request interval of 10–15 seconds and the absence of Referer headers on initial crawls.
📊 Data Usage
All collected text, markup, and metadata are processed through an internal deduplication pipeline and then fed into the training of IndexGPT‑4, a large language model fine‑tuned for factual question‑answering. According to the company’s privacy policy (aiwebindex.com/privacy), no personally identifiable information is retained beyond 30 days, and all data is stored in encrypted S3 buckets with access controls audited quarterly.
⚙️ Rate Limiting Policy
Rate limiting is recommended because AIWebIndex can generate up to 10 requests per second per source IP during initial crawls, which may degrade server performance for smaller sites. A threshold of 100 requests per 60 seconds per IP with a 429 response is a standard protective measure, as advised in the official bot documentation under the “Webmaster Best Practices” section.
Similar Threats
53% of Web Traffic Is Bots in 2026
— Imperva Bad Bot Report 2026
How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server — completely free.
📊 Get My Bot ReportSign up in seconds · No card required
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.