AIWebIndex

Bot User-Agent: aiwebindex

🤖 Overview

AIWebIndex is a web crawler operated by AI Web Index Ltd., a UK‑based company established in 2022, designed to systematically index publicly accessible web pages for the purpose of training large language models and building domain‑specific knowledge corpora. According to the official documentation available at aiwebindex.com/docs, the bot feeds data into the company’s proprietary AI training pipeline, which powers their “IndexGPT” product series.

🌐 Technical Behavior

The crawler operates using a distributed fleet hosted primarily on AWS EC2 instances in the us‑east‑1 and eu‑west‑1 regions, with IP ranges published in the ASN AS16509 and AS14618. AIWebIndex maintains a maximum concurrency of 4 threads per IP and respects the Crawl‑Delay directive in robots.txt with a default delay of 10 seconds. It sends HTTP requests with Accept‑Encoding: gzip, deflate and uses TLS 1.3 for all connections. The bot rotates user‑agent strings every 1,000 requests to avoid fingerprinting and follows a breadth‑first crawl strategy with a 24‑hour revisitation interval for unchanged pages.

📋 robots.txt Compliance

AIWebIndex fully honors Disallow directives as verified by independent testing published on GitHub (github.com/ai‑web‑index/robots‑tester). It caches robots.txt for up to one hour and re‑fetches upon any 301 redirect. The only known exception is that it ignores Allow override directives that conflict with a broader Disallow rule, aligning with the original robots.txt specification.

🔍 Detection Indicators

The primary User‑Agent string is AIWebIndex/2.0 and Mozilla/5.0 (compatible; AIWebIndex/2.0; +https://aiwebindex.com/crawler). The bot also sends a custom HTTP header X‑Crawler‑ID: aiwebindex‑v2 and a From header containing [email protected]. Behavioral fingerprints include a consistent request interval of 10–15 seconds and the absence of Referer headers on initial crawls.

📊 Data Usage

All collected text, markup, and metadata are processed through an internal deduplication pipeline and then fed into the training of IndexGPT‑4, a large language model fine‑tuned for factual question‑answering. According to the company’s privacy policy (aiwebindex.com/privacy), no personally identifiable information is retained beyond 30 days, and all data is stored in encrypted S3 buckets with access controls audited quarterly.

⚙️ Rate Limiting Policy

Rate limiting is recommended because AIWebIndex can generate up to 10 requests per second per source IP during initial crawls, which may degrade server performance for smaller sites. A threshold of 100 requests per 60 seconds per IP with a 429 response is a standard protective measure, as advised in the official bot documentation under the “Webmaster Best Practices” section.

53% of Web Traffic Is Bots in 2026

— Imperva Bad Bot Report 2026

How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server — completely free.

📊 Get My Bot Report

Sign up in seconds  ·  No card required

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.