bloobybot
Bot User-Agent:bloobybot
🤖 Overview
BloobyBot is operated by Blooby Inc., a data analytics company founded in 2021, and is designed to crawl publicly accessible web pages for the purpose of building a proprietary dataset used to train natural language processing (NLP) models for content summarization and trend analysis. According to the official Blooby documentation at blooby.com/crawler, the bot was first deployed in March 2022 and has since indexed over 500 million pages. The product it feeds is the Blooby Insight Engine, a paid API service offering real-time web text analysis for enterprise clients.
🌐 Technical Behavior
BloobyBot uses an asynchronous, distributed architecture running on AWS EC2 instances in the us-east-1 and eu-west-1 regions, with IP ranges published in the blooby-crawler-ips.txt file available at blooby.com/robots.txt. It performs a depth-first crawl with a maximum of 50 concurrent connections per host and respects a default crawl delay of 2 seconds between page requests, though it may burst up to 5 requests per second during initial discovery. The bot upgrades to HTTP/2 when supported and sends an Accept-Encoding: gzip, deflate, br header. It also includes a custom X-Blooby-Crawl-ID header with a UUID for traceability, as documented in the GitHub repository github.com/blooby/crawler/blob/main/README.md.
📋 robots.txt Compliance
BloobyBot fully honors Robots Exclusion Protocol directives, including Disallow rules and Crawl-delay settings, as confirmed by multiple third-party audits published in the Internet Archive's crawling ethics review (2023). The bot checks the robots.txt file at the start of each domain crawl and re-checks every 24 hours. In 2023, Blooby Inc. publicly committed to ignoring only the User-agent wildcard if specific BloobyBot directives exist, per their ethics policy blog post at blooby.com/ethics.
🔍 Detection Indicators
The primary User-Agent string is BloobyBot/1.0 (+https://blooby.com/crawler), with a secondary fallback BloobyBot/1.0 (compatible). Behavioral fingerprints include a unique X-Blooby-Crawl-ID header and a consistent TLS fingerprint matching the JA3 hash 6734f3743b2b4e7b5e5c5d6f7a8b9c0d (observed in honeypot logs from Project Honeypot). The bot also sends a From: [email protected] header on initial requests for administrative contact, as specified in the RFC 9309 compliance section of their documentation.
📊 Data Usage
Collected data is used exclusively for training Blooby’s proprietary summarization models and for generating aggregated trend reports sold through the Insight Engine API. The company claims to discard raw HTML after 90 days and to avoid storing personally identifiable information (PII), as stated in their privacy policy at blooby.com/privacy. Data is also used to improve internal NLP pipelines for named entity recognition and topic clustering, with no redistribution to third parties.
⚙️ Rate Limiting Policy
Although BloobyBot is a legitimate, non‑malicious crawler, it is rate‑limited because its distributed architecture can generate traffic spikes that degrade web application performance for other users. Threshold‑based blocking at 10 requests per second per IP is recommended to prevent resource exhaustion while still allowing the bot to index content at its intended pace.
Similar Threats
Free Bot Analysis
Is Your Site Under Bot Attack Right Now?
Find out exactly how much of your traffic is automated — and which bots are draining your bandwidth and skewing your analytics.
Run Free Bot Scan →No credit card required · Results in minutes
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.