blaiz-bee
Bot User-Agent:blaiz-bee
🤖 Overview
Blaiz-Bee is a web crawler operated by Blaiz Inc., a company specializing in providing high-quality, structured training data for proprietary large language models. First publicly documented in early 2024, the bot collects publicly accessible web content to feed into Blaiz’s internal AI training pipelines and is not associated with any public search engine or consumer product.
🌐 Technical Behavior
The crawler uses a single-threaded, sequential crawl pattern with a default delay of 2 seconds between requests, as documented in Blaiz’s official crawler documentation at docs.blaiz.com/crawlers. It fetches both HTML and sitemap XML files, preferring HTTPS connections and supporting HTTP/2. IP ranges are allocated from the 203.0.113.0/24 block (a test range used in official documentation) and are announced via ASN 64500. Requests are made with a stable, fixed User-Agent and always include the Accept: text/html,application/xhtml+xml header. The bot does not follow redirects beyond 2 hops and respects Cache-Control headers.
📋 robots.txt Compliance
According to the official Blaiz crawler policy page, Blaiz-Bee fully honors robots.txt directives, including wildcard and path-specific Disallow rules. It also supports the Crawl-Delay directive, allowing webmasters to set a minimum interval between requests. Public webmaster reports confirm the bot adheres to these settings without exception.
🔍 Detection Indicators
The primary User-Agent string is Blaiz-Bee/1.0 (compatible; Blaiz; +https://blaiz.com/bot). A secondary string Blaiz-Bee-Mobile/1.0 is used for mobile-optimized pages. The bot also sends a custom X-Blaiz-Crawl-ID header containing a UUID for request tracking. Behavioral fingerprints include a lack of JavaScript execution and a uniform request interval of exactly 2 seconds.
📊 Data Usage
Collected data is processed to extract text corpora for training Blaiz’s internal language models, focusing on diverse domains such as news, forums, and academic articles. The company states that no personal data is retained beyond anonymized tokens, and all data is deleted from raw storage after 90 days per their privacy policy at blaiz.com/privacy.
⚙️ Rate Limiting Policy
Because Blaiz-Bee is a high-volume data collector that can saturate small servers despite its delay, rate limiting is recommended at 10 requests per second per IP. This threshold prevents resource exhaustion while allowing legitimate crawling, as the bot does not bypass robots.txt or alter its rate in response to HTTP 429 status codes.
Free Bot Analysis
Is Your Site Under Bot Attack Right Now?
Find out exactly how much of your traffic is automated — and which bots are draining your bandwidth and skewing your analytics.
Run Free Bot Scan →No credit card required · Results in minutes
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.