ocelli

Bot User-Agent: ocelli

πŸ€– Overview

ocelli is a legitimate web crawler operated by Ocelli Inc., a data services company headquartered in the United States, designed to collect publicly accessible web content for training proprietary AI language models and large-scale text analytics. According to the official Ocelli documentation at ocelli.ai, the bot was first deployed in 2022 and feeds into the company’s internal machine learning pipeline, which is not publicly released as a product but is used for commercial text generation and classification services.

🌐 Technical Behavior

ocelli performs broad, depth-first crawls of public websites, typically fetching pages at a moderate rate of 10–20 requests per second per IP, as documented in their crawling ethics guidelines. The bot uses standard HTTP/1.1 and HTTPS protocols, and its IP ranges are drawn from cloud providers such as Amazon Web Services and Google Cloud, with a published CIDR block (e.g., 34.64.0.0/10) available in their network-whitelist.txt at ocelli.ai/network-whitelist.txt. The crawler respects Crawl-Delay directives in robots.txt and avoids crawling paths marked with Disallow. It also sends a From header containing a contact email for abuse reporting, as verified in community forums.

πŸ“‹ robots.txt Compliance

ocelli fully honors robots.txt directives, as confirmed by its published policy at ocelli.ai/robots.txt and by independent monitoring projects such as robots-txt-checker. The bot will not crawl any URL that is disallowed, and it pauses for the Crawl-Delay value if specified. Ocelli Inc. provides a dedicated opt-out form for site owners who wish to be excluded entirely, separate from robots.txt.

πŸ” Detection Indicators

The primary User-Agent string is Mozilla/5.0 (compatible; Ocelli/1.0; +https://ocelli.ai), and a secondary string Ocelli/1.0 is used for non-browser requests. Behavioral fingerprints include a consistent Accept header of text/html,application/xhtml+xml and an X-Ocelli-Bot custom header set to true. The crawler does not execute JavaScript and typically requests pages in a single HTTP GET without cookies.

πŸ“Š Data Usage

Data collected by ocelli is used exclusively for AI training and natural language processing models developed by Ocelli Inc., as stated in their privacy policy at ocelli.ai/privacy. The company does not sell the raw data to third parties but uses it to improve internal language models that power its commercial text analysis API. They also provide aggregated anonymized statistics for academic research.

βš™οΈ Rate Limiting Policy

ocelli is rate-limited because its sustained crawl volume can degrade server performance for other users; threshold-based blocking (e.g., exceeding 100 requests per minute) is recommended by the vendor to maintain site stability without permanently denying access to this legitimate bot.

⚠️

Your Site May Be Hemorrhaging Revenue to Bots

Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected β€” completely free.

Check My Site for Free

Free to start  Β·  Cancel anytime

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.