snoopy

Bot User-Agent: snoopy

🤖 Overview

Snoopy is a web crawler operated by the University of Michigan’s Internet Performance and Analysis (IPA) research group within the Department of Computer Science and Engineering. Its primary mission is to conduct large-scale, longitudinal measurements of internet security, performance, and censorship for academic research, as documented at snoopy.cs.umich.edu. Unlike commercial crawlers, Snoopy is not used for AI training or search indexing, but exclusively for scholarly analysis of global internet properties.

🌐 Technical Behavior

Snoopy performs periodic scans of the entire IPv4 address space and DNS domains using HTTP and HTTPS requests. According to the official website, the crawler employs a distributed cluster of nodes behind the University of Michigan’s ASN (AS3634), with IP ranges including 141.211.0.0/16 and 35.231.0.0/16. Request frequency is deliberately low—typically a few requests per second per target—and the crawler uses randomized delays to reduce impact. It supports HTTP/1.1, HTTP/2, and HTTPS with SNI, and may send probes to common paths like /robots.txt, /index.html, and /.well-known/. The crawler also tests for open proxies, misconfigured TLS certificates, and firewall evasion techniques, as described in the group’s peer-reviewed publications.

📋 robots.txt Compliance

Snoopy strictly honors robots.txt directives. The official website states that it checks the file before each request and fully respects Disallow rules. Additionally, the project provides a contact email ([email protected]) and an opt-out form on its website for website owners who wish to block scanning entirely, even beyond robots.txt.

🔍 Detection Indicators

The primary User-Agent string is "Mozilla/5.0 (compatible; Snoopy/1.0; +https://snoopy.cs.umich.edu/)", with version variations such as Snoopy/2.0 and Snoopy/3.0. Behavioral fingerprints include sequential scans of contiguous IP blocks, low request rates (often <1 request per second), and a distinctive X-Snoopy custom header sometimes present. The crawler may also send requests with unusual Accept-Language or Accept-Encoding values to gauge server behavior. DNS reverse lookups often resolve to snoopy.cs.umich.edu or ipa.umich.edu.

📊 Data Usage

Collected data is used exclusively for academic research on internet security and performance—such as mapping HTTPS deployment, identifying vulnerable servers, and analyzing censorship techniques. Results are published in top-tier venues like ACM IMC, USENIX Security, and IEEE S&P. No commercial use, AI training, or advertising revenue is involved. The group also releases anonymized datasets to the research community.

⚙️ Rate Limiting Policy

Snoopy is rate-limited because its scanning, though legitimate, can still generate significant load on small or poorly configured servers. The recommended policy is threshold-based blocking (e.g., block if >100 requests per minute from the same IP), which preserves the crawler’s research utility while protecting web applications from inadvertent denial-of-service.

Free Traffic Analysis

What's Actually Crawling Your Website?

Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from — real data from your own traffic, not guesswork.

🔍 Scan My Site Free

Powered by JA4 fingerprinting, honeypot traps & behavioral analysis

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.