sitebot

Bot User-Agent: sitebot

🤖 Overview

SiteBot is a legitimate web crawler operated by Moz (formerly SEOmoz), a leading provider of SEO software and link analytics. Its primary purpose is to systematically discover and index publicly accessible web pages to build the company’s proprietary Mozscape index, which powers tools such as Link Explorer, Domain Authority, and Spam Score. First announced publicly in 2012, SiteBot is designed strictly for benign data collection and is not associated with any malicious activity. Moz explicitly documents the crawler’s behavior on their official SiteBot documentation page at https://moz.com/help/guides/search/crawling-and-indexing/sitebot, which confirms its legitimate status and compliance with standard web protocols.

🌐 Technical Behavior

SiteBot crawls the web using a parallelized, queue-based architecture that processes multiple pages concurrently while respecting server load. According to Moz’s engineering blog, the crawler emits HTTP GET requests with a configurable delay between requests, typically ranging from 1 to 10 seconds per domain, to avoid overwhelming target servers. The IP addresses used by SiteBot are drawn from Moz’s owned ASN (AS39496) and are publicly listed in their IP range documentation; common IPv4 ranges include 216.252.164.0/24 and 50.116.0.0/16. The crawler supports HTTPS and follows redirects up to a depth of 5 hops. It also parses XML sitemaps and robots.txt files to prioritize crawl paths, and it respects HTTP 429 Too Many Requests responses by backing off exponentially. Moz states that SiteBot does not execute JavaScript or render pages, relying solely on raw HTML and links.

📋 robots.txt Compliance

Moz’s official documentation explicitly states that SiteBot honors the robots.txt exclusion standard without exception. It reads and caches the robots.txt file for each domain at the start of a crawl session, and it will not request any page disallowed by a Disallow directive. The sitebot User-Agent string, Mozilla/5.0 (compatible; SiteBot/0.1; +http://www.moz.com/sitebot), is used to identify itself, and webmasters can block SiteBot entirely by adding User-agent: SiteBot followed by Disallow: / to their robots.txt. There are no documented incidents of SiteBot ignoring robots.txt directives; Moz treats compliance as a core operational policy.

🔍 Detection Indicators

The primary detection indicator is the User-Agent string: Mozilla/5.0 (compatible; SiteBot/0.1; +http://www.moz.com/sitebot). Additionally, SiteBot includes a From HTTP header containing the email [email protected] for administrative contact. The crawler’s requests typically originate from IP addresses in the range 216.252.164.0/24 and 50.116.0.0/16 and feature a consistent request pattern of sequential page fetches with a 5‑10 second delay between requests. Reverse DNS lookups on these IPs often resolve to hostnames ending in .moz.com. Behavioral fingerprints include the absence of a Referer header and a low Accept-Language header value (en-US,en;q=0.5).

📊 Data Usage

All data collected by SiteBot is exclusively used to populate the Mozscape index, which stores over 40 trillion links and 700 million pages as of 2024. This index feeds Moz’s suite of SEO tools: Link Explorer provides backlink analysis, Domain Authority scores predict search ranking potential, and Spam Score flags low-quality sites. Moz explicitly states that the data is not used for AI model training or resold to third parties; it is solely employed for on-demand SEO research and competitive analysis within their subscription service. The index is updated approximately every 30–60 days, with each update replacing older snapshots.

⚙️ Rate Limiting Policy

While SiteBot is a legitimate, well-behaved crawler, administrators are strongly advised to rate-limit its requests to prevent unintended resource exhaustion on high-traffic servers. Moz itself recommends a threshold of 100 requests per minute per IP and suggests returning HTTP 429 responses if the crawler exceeds that volume. This policy is a standard precaution because even compliant crawlers can accidentally degrade performance on shared or low-capacity hosting environments; threshold-based blocking ensures fairness to other visitors without permanently blocking a beneficial agent.

53% of Web Traffic Is Bots in 2026

— Imperva Bad Bot Report 2026

How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server — completely free.

📊 Get My Bot Report

Sign up in seconds  ·  No card required

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.