alkalinebot

Bot User-Agent: alkalinebot

πŸ€– Overview

Alkalinebot is a web crawler operated by Alkaline, a company focused on AI training data acquisition. Its primary purpose is to collect publicly accessible web content for training large language models and other machine learning systems. The data feeds into Alkaline’s proprietary model training pipelines, which are used for natural language understanding, generation, and summarization tasks. According to official documentation on the Alkaline website, the bot is designed to be non-aggressive and respect webmaster preferences.

🌐 Technical Behavior

Alkalinebot performs HTTP/1.1 and HTTPS requests with a default crawl delay of 2–5 seconds per domain, configurable via the robots.txt Crawl-Delay directive. It uses a rotating pool of IP addresses from multiple cloud providers including AWS and Google Cloud, with ranges documented in the Alkaline IP list published at https://alkaline.ai/ips.txt. The bot does not execute JavaScript, load images, or follow nofollow links; it only retrieves static HTML content. Crawl order is breadth-first, prioritizing pages with high inlink counts. It sends a custom X-Alkaline-Bot header set to true for identification.

πŸ“‹ robots.txt Compliance

Alkalinebot fully honors all Disallow directives as documented in its official robots.txt policy at https://alkaline.ai/robots.txt. It checks robots.txt before every crawl session and respects Crawl-Delay values. No known incidents of non-compliance have been reported in security advisories or webmaster forums.

πŸ” Detection Indicators

The primary User-Agent string is Mozilla/5.0 (compatible; Alkalinebot/1.0; +https://alkaline.ai/bot). Behavioral fingerprints include a consistent request to /robots.txt before any other resource, a low request rate (typically under 10 requests per minute per IP), and the absence of Referrer headers. The bot also supports Accept-Encoding: gzip and uses Connection: keep-alive.

πŸ“Š Data Usage

Collected content is used exclusively for training AI models, including generative language models and text classification systems. Alkaline’s privacy policy states that data is anonymized and not sold to third parties. The company publishes quarterly transparency reports detailing crawl volumes and domain coverage.

βš™οΈ Rate Limiting Policy

Although Alkalinebot is designed to be non-aggressive, its broad crawl scope can still consume significant server resources. Rate limiting is recommended to prevent service degradation; threshold-based blocking is justified only if the bot exceeds a reasonable request rate (e.g., 50 requests per minute) despite its built-in delays.

Free Traffic Analysis

What's Actually Crawling Your Website?

Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from β€” real data from your own traffic, not guesswork.

πŸ” Scan My Site Free

Powered by JA4 fingerprinting, honeypot traps & behavioral analysis

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.