byindia
Bot User-Agent:byindia
🤖 Overview
ByIndia is a web crawler operated by ByIndia Technologies Pvt Ltd, an Indian artificial intelligence startup based in Bangalore. First publicly documented in March 2024, its primary purpose is to collect publicly available web content for training large language models and building a domain-specific search index for Indic languages. According to the official ByIndia Bot Policy (byindia.ai/bot-policy), the crawler is designed to support the company’s BharatGPT product, which focuses on multilingual AI for Indian regional languages. The bot was introduced alongside a transparency report published on the company’s GitHub repository (github.com/byindia/crawler).
🌐 Technical Behavior
ByIndia requests pages at a default rate of 5 requests per second per IP, with bursts allowed up to 10 req/s as per the robots.txt crawl-delay directive. The crawler uses HTTP/1.1 and HTTP/2, and respects Last-Modified and ETag headers to reduce load on origin servers. IP ranges are allocated from the ASN AS152194 (ByIndia Tech), covering a /22 subnet (103.102.192.0/22) as verified on RIPE and ARIN databases. The bot follows a breadth-first crawl strategy, prioritizing pages with Content-Language headers indicating Indian languages (Hindi, Tamil, Bengali, etc.). It also parses JSON-LD and microdata for structured data extraction. Requests are made with a custom Accept-Language: hi,en;q=0.9 header to signal language preference.
📋 robots.txt Compliance
According to the official documentation on byindia.ai/crawler, ByIndia fully honors Disallow directives in robots.txt. The bot also supports the Crawl-Delay directive and will adjust its request rate accordingly. In a February 2025 audit by the Internet Archive, ByIndia was found to be one of the top 5% compliant crawlers, with no reported violations. The crawler additionally respects X-Robots-Tag headers and noindex meta tags.
🔍 Detection Indicators
The primary User-Agent string is ByIndiaBot/1.0 (compatible; byindia.ai/crawler). Additionally, a secondary UA string ByIndiaBot-Image/1.0 is used for image fetching. Behavioral fingerprints include the custom Accept-Language header containing Hindi priority, and a From header with the email [email protected]. The bot also sets a timestamp in the X-ByIndia-Crawl header for internal tracking. Official documentation suggests site owners can verify the IP against the published range via a WHOIS lookup.
📊 Data Usage
Collected web content is used exclusively to train and fine-tune BharatGPT models and to build a search index for the ByIndia Search engine (beta). According to the company’s privacy policy (byindia.ai/privacy), raw page text is stored for up to 180 days, after which only aggregate statistics and model weights are retained. ByIndia also publishes a quarterly transparency report detailing the volume and sources of crawled data.
⚙️ Rate Limiting Policy
ByIndia is rate-limited because it maintains a consistent, high-volume crawl rate that can strain smaller web servers, especially those with limited bandwidth. The recommended threshold for rate-limiting is 20 requests per second per IP; blocking should be applied only after repeated violations of the robots.txt Crawl-Delay setting, as documented in the official admin guide (byindia.ai/admin-rate-limit).
Similar Threats
Free Traffic Analysis
What's Actually Crawling Your Website?
Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from — real data from your own traffic, not guesswork.
🔍 Scan My Site FreePowered by JA4 fingerprinting, honeypot traps & behavioral analysis
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.