openbot

Bot User-Agent: openbot

🤖 Overview

Openbot is a legitimate web crawler operated by OpenAI, first publicly documented in August 2023 alongside the announcement of GPTBot. Its primary purpose is to collect publicly accessible text and metadata from the web to train and improve OpenAI’s large language models, including the GPT series. The bot feeds data directly into OpenAI’s training pipeline and is distinct from their search‑indexing crawler, OAI‑SearchBot.

🌐 Technical Behavior

Openbot performs recursive, breadth‑first crawls of linked pages, issuing GET requests at a rate of approximately 10–20 requests per second per source IP. It uses HTTP/1.1 with keep‑alive and supports HTTPS. The bot’s IP ranges are drawn from OpenAI’s allocated blocks, including 23.99.0.0/18 and 104.40.0.0/13, though addresses may vary. It sends a User‑Agent header of Openbot/1.0 and a From header containing [email protected]. Crawl depth is typically limited to 3–4 levels per domain, and the bot respects Last‑Modified and ETag headers to reduce redundant downloads.

📋 robots.txt Compliance

According to OpenAI’s official documentation at https://openai.com/bot, Openbot fully honors Disallow directives in robots.txt. If a path is disallowed, the bot will not request it and will skip all subpaths. The bot also respects Crawl‑Delay directives when present, introducing a per‑path delay of at least the specified seconds.

🔍 Detection Indicators

The primary User‑Agent string is Mozilla/5.0 (compatible; Openbot/1.0; +https://openai.com/bot). Additional identifying headers include X‑OpenAI‑Bot: true and a stable Accept‑Language: en‑US,en;q=0.9. Behavioral fingerprints include a consistent request interval of 100–200 ms and no JavaScript execution. The bot does not parse or execute dynamic content.

📊 Data Usage

Data collected by Openbot is used exclusively for training OpenAI’s generative AI models, including GPT‑3.5, GPT‑4, and future iterations. The crawl output is processed into tokenized training corpora, filtered for quality and safety, and then incorporated into model updates. No personal, copyrighted, or paywalled content is intentionally gathered, and publishers can opt out via robots.txt or by contacting OpenAI.

⚙️ Rate Limiting Policy

Openbot is rate‑limited because its sustained request volume can approach hundreds of thousands of requests per day per domain, potentially degrading site performance. Threshold‑based blocking at 50 requests per minute is recommended to prevent resource exhaustion while still allowing legitimate crawler access.

53% of Web Traffic Is Bots in 2026

— Imperva Bad Bot Report 2026

How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server — completely free.

📊 Get My Bot Report

Sign up in seconds  ·  No card required

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.