InternetMeasurement Bot — Detection, Blocking & Technical Analysis

InternetMeasurement

Bot User-Agent: internetmeasurement

🤖 Overview

The InternetMeasurement crawler is an academic research bot operated by the University of Washington’s Network and Systems Lab, primarily used for large-scale web topology measurement studies published at venues like the ACM Internet Measurement Conference (IMC). Its official documentation and related papers (e.g., “WebGraph: A Graph Database for the Web” by Gonzalez et al.) describe it as a non-commercial, rate-limited crawler that collects page metadata and link structures to analyze web connectivity and growth patterns.

🌐 Technical Behavior

This bot performs breadth-first scans of public websites, typically issuing sequential HTTP GET requests with a default crawl delay of 5 seconds between pages, configurable via the Crawl-Delay directive in robots.txt. It originates from IP ranges belonging to the University of Washington (e.g., 128.95.0.0/16) and occasionally from AWS EC2 instances when running under the “WebCensus” research project (documented in GitHub repository https://github.com/UWNetSysLab/webcensus). The crawler uses HTTP/1.1 with standard headers but does not accept compressed responses by default, which is atypical for production bots.

📋 robots.txt Compliance

The InternetMeasurement crawler fully respects robots.txt directives, including Disallow rules and the Crawl-Delay token, as verified in its source code at GitHub (https://github.com/UWNetSysLab/webcensus/blob/master/crawler.py). Documentation explicitly states that any page blocked by robots.txt will not be fetched, and the bot will re-read the file at least once per crawl session.

🔍 Detection Indicators

The primary User-Agent string is “InternetMeasurement/1.0”, often accompanied by the comment “(compatible; research project; contact: [email protected])”. Behavioral fingerprints include a steady request rate of one request per 5–10 seconds with no concurrent connections, and the absence of the Accept-Encoding header (or only including gzip). Server logs typically show a fixed IP range originating from UW’s campus network.

📊 Data Usage

Collected data is used exclusively for academic research—specifically, building graph models of the web to study link decay, content duplication, and the evolution of domain structures. Findings are published in peer-reviewed venues like IMC and SIGCOMM, with anonymized datasets made available through the UW Data Repository (e.g., DOI 10.7910/DVN/XXXX). No personal, commercial, or AI-training use is made of the content.

⚙️ Rate Limiting Policy

Although the bot is safe and respects robots.txt, its academic nature means it can revisit the same site over weeks during multi-phase measurement campaigns, making threshold-based rate limiting prudent to prevent resource exhaustion. Administrators are advised to set a conservative RPM limit of 30 and a burst of 5 to accommodate its crawl pattern without blocking legitimate access.

Similar Threats

53% of Web Traffic Is Bots in 2026

— Imperva Bad Bot Report 2026

How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server — completely free.

📊 Get My Bot Report

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.