OpenAI Bot — Detection, Blocking & Technical Analysis

OpenAI

Bot User-Agent: openai

🤖 Overview

OpenAI operates two legitimate web crawlers: GPTBot (announced August 7, 2023) and OAI-SearchBot (launched August 21, 2024). GPTBot collects publicly accessible web content to train and improve OpenAI’s generative AI models, including GPT-4, GPT-4 Turbo, and the upcoming GPT-5, as detailed on platform.openai.com/docs/gptbot. OAI-SearchBot indexes pages specifically to power search features within ChatGPT and other OpenAI products, enabling real-time information retrieval. Both agents are non‑malicious, rate‑limited, and designed to respect webmaster preferences.

🌐 Technical Behavior

GPTBot uses a custom HTTP client with a User‑Agent string of “Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible with GPTBot/1.0; +https://openai.com/robot”. It requests at a moderate rate of approximately 1 request every 2–10 seconds per host, as observed by site operators. OAI‑SearchBot identifies as “Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible with OAI‑SearchBot/1.0; +https://openai.com/searchbot”. Both crawlers originate from OpenAI’s published IP ranges, which include 20.115.0.0/16, 23.98.0.0/16, 40.126.0.0/16, and 152.195.0.0/16 (full list at openai.com/robot). Crawl requests use standard HTTP/1.1 and HTTPS, support gzip compression, and do not execute JavaScript. The crawlers follow href links only; they ignore

‑generated URLs and _escaped_fragment_ conventions. Reverse DNS lookups for IPs from these ranges consistently resolve to hostnames like prod‑gptbot‑*.openai.com.

📋 robots.txt Compliance

OpenAI explicitly states that both GPTBot and OAI‑SearchBot honor robots.txt Disallow directives. Official documentation on openai.com/robot advises webmasters to use standard robots.txt rules, including Crawl‑Delay and wildcards, to control crawling. OAI‑SearchBot additionally supports the noindex meta tag and the X‑Robots‑Tag HTTP header for granular per‑page opt‑out. Verified by community tests, the crawlers pause before fetching a disallowed resource.

🔍 Detection Indicators

The primary detection method is the User‑Agent string containing either “GPTBot/1.0” or “OAI‑SearchBot/1.0” with the Mozilla preamble. Reverse DNS on the connecting IP from the OpenAI ranges confirms legitimacy. Behavioral fingerprints include a steady request pace (no sudden bursts), absence of HTTP referrers, and a consistent Accept‑Encoding: gzip header. Both crawlers set no custom HTTP headers beyond standard ones. Server logs showing only one or two requests per minute from a single IP are typical.

📊 Data Usage

GPTBot‑collected data is used to train and refine OpenAI’s generative AI models, as described in the company’s privacy policy and terms of use (available at openai.com/policies). OAI‑SearchBot data feeds real‑time search indexing for ChatGPT’s browsing feature, the ChatGPT Search prototype, and other OpenAI services that require up‑to‑date web content. OpenAI does not sell the crawled data and asserts that pages blocked via robots.txt or meta tags are excluded from both training and indexing.

⚙️ Rate Limiting Policy

Rate limiting is recommended because these crawlers, while legitimate, can generate moderate traffic that may degrade server performance if left unlimited. A threshold‑based block (e.g., 10 requests per minute per IP) preserves site stability while still allowing the beneficial crawling necessary for AI training and search augmentation. OpenAI themselves advise using Crawl‑Delay: 10 in robots.txt to signal a preferred request spacing.

Similar Threats

53% of Web Traffic Is Bots in 2026

— Imperva Bad Bot Report 2026

How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server — completely free.

📊 Get My Bot Report

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.

OpenAI

🤖 Overview

🌐 Technical Behavior

📋 robots.txt Compliance

🔍 Detection Indicators

📊 Data Usage

⚙️ Rate Limiting Policy

53% of Web Traffic Is Bots in 2026

Company

Resources

Services

Trusted

Subscribe