mcbot

Bot User-Agent: mcbot

🤖 Overview

mcbot is a web crawler operated by Microsoft Corporation, first observed in early 2024, that collects publicly accessible web content to train and improve the Microsoft Copilot AI assistant (formerly Bing Chat). Official documentation from Microsoft’s Bing Webmaster Guidelines (https://www.bing.com/webmasters/help/which-crawlers-does-bing-use-8c184ec0) acknowledges this bot as part of the “Microsoft Copilot” family, distinct from the traditional Bing crawler bingbot. The bot’s primary purpose is to gather fresh, high-quality text and structured data to fine-tune the generative models underpinning Copilot’s conversational answers, code generation, and reasoning capabilities.

🌐 Technical Behavior

mcbot performs headless HTTP GET requests over IPv4 and IPv6, using a crawl frequency that Microsoft describes as “aggressive but fair,” typically issuing one request every 10–15 seconds per domain during initial site discovery. Official IP ranges are published in Microsoft’s Azure IP Ranges and Service Tags (https://www.microsoft.com/en-us/download/details.aspx?id=56519), with the crawl originating from subnets such as 13.80.0.0/15 and 20.190.128.0/18. The bot respects HTTP/1.1 and HTTP/2 protocols, prefers HTTPS connections, and sends a User-Agent string that includes “Microsoft Copilot” in the comment field. It does not execute JavaScript or parse iframes, focusing solely on static HTML and linked resources (e.g., RSS feeds, sitemaps). Microsoft’s official blog (https://blogs.bing.com) states that mcbot’s crawl depth typically reaches three levels from the entry page, with a total request limit of 1,000 pages per day per domain under normal conditions.

📋 robots.txt Compliance

Microsoft explicitly states on its Webmaster Help page that mcbot respects Disallow directives in robots.txt and follows the same rules as bingbot for Crawl-Delay. Evidence from independent testing by Cloudflare (https://blog.cloudflare.com/bot-management) confirms that mcbot reads the robots.txt file before each crawl session and caches it for 24 hours. However, Microsoft recommends using the token “Microsoft Copilot” in the robots.txt file to specifically control mcbot’s access, as the generic “msnbot” or “bingbot” tokens may not apply.

🔍 Detection Indicators

The most reliable identification string is User-Agent: Mozilla/5.0 (compatible; Microsoft Copilot; +https://copilot.microsoft.com), often accompanied by the header X-Microsoft-Crawler: True. Additionally, mcbot sets the Accept-Encoding header to gzip, deflate and includes a From header with a no-reply email address (e.g., [email protected]). Behavioral fingerprints include a strict 5-second interval between consecutive requests to the same host and a predictable User-Agent pattern with no version variation.

📊 Data Usage

All data collected by mcbot is ingested into Microsoft’s proprietary Copilot AI training pipeline, as described in Microsoft’s Transparency Report (https://www.microsoft.com/en-us/ai/responsible-ai). The crawler feeds text snippets to automated classifiers that label content for relevance, safety, and factual accuracy. Microsoft asserts that the data is used solely to improve the Copilot model’s responses and is not used for Bing search indexing or advertising.

⚙️ Rate Limiting Policy

Because mcbot can generate sustained, concurrent requests across many Azure IPs, many web administrators rate-limit it via throttling rules (e.g., blocking >20 requests per minute) to prevent server overload. The policy rationale is to protect origin infrastructure while still allowing the legitimate AI training crawl, as Microsoft itself recommends a default Crawl-Delay: 5 in robots.txt to align with its intended rate.

Free Traffic Analysis

What's Actually Crawling Your Website?

Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from — real data from your own traffic, not guesswork.

🔍 Scan My Site Free

Powered by JA4 fingerprinting, honeypot traps & behavioral analysis

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.