boi_crawl_00
Crawler User-Agent:boi-crawl-00
๐ค Overview
The boi_crawl_00 crawler is operated by Microsoft as part of the Bing search engine's indexing ecosystem, specifically for the Bing Open Index (BOI) initiative that collects public web content for both search results and AI model training. According to Microsoft's official Bing Webmaster Tools documentation, this bot was introduced in 2023 to support the growing need for high-quality training data for large language models like those powering Microsoft Copilot and Azure OpenAI services.
๐ Technical Behavior
Crawl patterns follow a breadth-first approach with a default request rate of approximately 10 requests per second per IP, as documented in Microsoft's crawler specifications. The bot primarily uses HTTP/1.1 with support for HTTP/2 and fetches both HTML and structured data like JSON-LD and RDFa. IP ranges are drawn from Microsoft's AS8075 network and are listed in the publicly available Bingbot IP range list at bing.com/webmasters. It sends a User-Agent header of โboi_crawl_00/1.0โ and includes a From header with a contact email address. The crawler respects Cache-Control headers and supports If-Modified-Since for conditional requests, reducing unnecessary bandwidth usage.
๐ robots.txt Compliance
Microsoft's official policy, published on the Bing Webmaster Guidelines page, states that boi_crawl_00 fully respects robots.txt Disallow directives. The crawler checks robots.txt at the start of each crawl session and caches the parsed rules for up to 24 hours, rechecking on changes signaled by the Last-Modified header. Any disallowed path is skipped without attempt to circumvent, and the bot does not access password-protected or explicitly blocked content.
๐ Detection Indicators
The definitive User-Agent string is โboi_crawl_00/1.0โ. Additional identifying headers include From: [email protected] and X-Ms-Crawler: boi-crawl. The bot also sends an Accept-Encoding header of gzip, deflate, br and an Accept header of text/html,application/xhtml+xml,application/xml. Behaviorally, it always requests robots.txt first and typically exhibits a low time-to-first-request interval of under 2 seconds per session.
๐ Data Usage
Collected data is primarily used to improve Bing search index relevance and to train Microsoft's generative AI models, including the GPT-based models used in Microsoft Copilot. Microsoft's privacy policy confirms that publicly accessible web data is processed for AI training while excluding paywalled or explicitly disallowed content. The data also feeds webmaster analytics in Bing Webmaster Tools and is retained according to Microsoft's data retention policies.
โ๏ธ Rate Limiting Policy
This bot is rate-limited because its aggressive crawl speed and broad scope can overwhelm small or poorly provisioned websites. A threshold-based blocking policy is recommended to protect server resources while still allowing legitimate indexing, as documented in Microsoft's rate-limiting advice for webmasters published in the official Bingbot FAQ.
โ ๏ธ
Your Site May Be Hemorrhaging Revenue to Bots
Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected โ completely free.
Check My Site for FreeFree to start ยท Cancel anytime
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.