Kozmosbot

Bot User-Agent: kozmosbot

🤖 Overview

Kozmosbot is a web crawler operated by Kozmos, a company specializing in AI‑powered web data extraction and dataset creation. First publicly documented in 2023, its primary purpose is to collect publicly accessible web content to feed into Kozmos’s data platform, which provides structured, high‑quality datasets for training large language models and other AI systems.

🌐 Technical Behavior

Kozmosbot performs both broad and targeted crawling, typically sending requests at a moderate rate of 10–20 requests per second per IP. It respects robots.txt and uses a configurable crawl delay. The bot originates from IP ranges registered to Amazon Web Services (AWS) and Google Cloud Platform (GCP), with ASNs such as AS14618 (AWS) and AS15169 (GCP). It identifies itself via the User‑Agent string “Mozilla/5.0 (compatible; Kozmosbot/1.0; +https://kozmos.com/bot)” and supports HTTPS connections. Kozmosbot also sends an Accept‑Language header of “en‑US,en;q=0.9” and a Referer header set to the bot’s documentation page. The crawler adheres to standard HTTP protocols and does not execute JavaScript, focusing solely on static HTML content.

📋 robots.txt Compliance

Kozmosbot fully honors Disallow directives in robots.txt files, as documented on its official page at https://kozmos.com/bot. It also respects Crawl‑Delay directives and can be blocked entirely by adding “User‑agent: Kozmosbot” and “Disallow: /” to the site’s robots.txt.

🔍 Detection Indicators

The primary detection indicator is the User‑Agent string: “Kozmosbot/1.0” or “Mozilla/5.0 (compatible; Kozmosbot/1.0; +https://kozmos.com/bot)”. Additional fingerprints include a characteristic request pattern of sequential URL fetching without referer spoofing, and a HTTP header “X‑Robots‑Tag” value of “noindex” when the bot is not allowed. The bot’s IPs are publicly listed in the Kozmos IP range published on their website.

📊 Data Usage

Data collected by Kozmosbot is used to build structured datasets that power AI training for language models and other machine learning applications. The datasets are also made available to Kozmos customers via APIs and downloads, with content aggregated and normalized. Kozmos claims to filter out personally identifiable information (PII) and abide by copyright considerations, though users should verify their site’s inclusion.

⚙️ Rate Limiting Policy

Because Kozmosbot can generate significant crawling volume (up to 20 requests per second per IP), it is rate‑limited by many web applications to prevent resource exhaustion. Standard practice is to allow a baseline of 50 requests per minute per IP, then throttle or block if the rate exceeds that threshold, ensuring fair access while protecting server stability.

🛡️

Stop Bots. Save Bandwidth. Protect Revenue.

Boteraser automatically detects and blocks unwanted bots — protecting your site from scrapers, DDoS bursts, and credential stuffing attacks without slowing down real visitors.

✅ Start Free Protection

Setup takes under a minute  ·  Free trial available

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.