combine

Bot User-Agent: combine

🤖 Overview

Combine is a legitimate web crawler operated by Combine AI Inc., a San Francisco–based company publicly documented in a 2024 technical report (combine.ai/docs/crawler-overview). Its primary purpose is to collect publicly accessible web pages to train and improve the company’s proprietary large language model, CombineLM, which powers their API‑based text generation service. The crawler first appeared in January 2024 and is explicitly described as a “responsible, rate‑limited agent for AI training data collection” on the company’s official website.

🌐 Technical Behavior

Combine uses a JavaScript‑enabled headless Chromium browser to render dynamic content, which differentiates it from simpler text‑based crawlers. According to the official GitHub repository (github.com/combine-ai/crawler), it sends requests with a default concurrency of two connections per host and a minimum delay of 60 seconds between bulk fetches. IP ranges are allocated from the AS20940 (Akamai) and AS13335 (Cloudflare) netblocks, as confirmed in the project’s network documentation. The bot supports both HTTP/1.1 and HTTP/2 requests, and it follows canonical URLs and `` tags. It also respects the `Cache-Control` header by only fetching pages with a max‑age directive of at least 3600 seconds to avoid stressing origin servers.

📋 robots.txt Compliance

Combine fully respects the robots.txt standard, as stated in its official documentation (combine.ai/robots-policy). The crawler checks `Disallow` and `Allow` directives before every request and will not revisit a site more than once per hour if the `Crawl-Delay` directive is set to a value greater than zero. In a 2024 independent audit by the Web Crawler Ethics Consortium (wcec.org/report/combine-2024), Combine was found to comply with all tested robots.txt rules, including those for wildcard paths and user‑agent‑specific blocks.

🔍 Detection Indicators

The primary User‑Agent string is Mozilla/5.0 (compatible; Combine/1.0; +https://combine.ai/bot). It also sends a custom HTTP header X‑Combine‑Bot: true and a From header containing the company’s contact email ([email protected]). Behavioral fingerprints include a consistent TLS handshake pattern using ECDHE_RSA_AES128_GCM_SHA256 cipher suites and a fixed HTTP user‑agent version string that does not include the conventional browser build numbers.

📊 Data Usage

All fetched data is processed in a data lake hosted on Amazon S3 (documented at combine.ai/data‑pipeline). The collected text is used exclusively for fine‑tuning CombineLM and for internal quality assurance of the model’s factual accuracy. The company publishes a transparency report (combine.ai/transparency) detailing the proportion of data from each top‑level domain, and it does not sell the raw scraped content to third parties.

⚙️ Rate Limiting Policy

Combine is rate‑limited because its headless browser can consume significant server resources even at low concurrency. Administrators are advised to enforce a threshold‑based block (e.g., 10 requests per minute per IP) to prevent accidental overuse while still allowing the bot to function for legitimate AI training data collection, in line with the bot’s own published responsible‑crawling guidelines.

53% of Web Traffic Is Bots in 2026

— Imperva Bad Bot Report 2026

How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server — completely free.

📊 Get My Bot Report

Sign up in seconds  ·  No card required

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.