webwalk

Bot User-Agent: webwalk

🤖 Overview

WebWalk is a web crawler operated by Common Crawl Foundation, the non‑profit organization behind the open‑source Common Crawl dataset. First deployed in 2015, its primary purpose is to collect publicly accessible web pages for the Common Crawl corpus, a freely available archive used by researchers, AI developers, and companies for training language models and conducting web‑scale analysis. The crawler is also employed by the WebWalk project (GitHub: commoncrawl/webwalk) for experimental crawl jobs and to refresh the dataset monthly.

🌐 Technical Behavior

WebWalk uses a custom asynchronous HTTP crawler written in Python (using aiohttp) and respects a maximum request rate of approximately 10 requests per second per IP, though burst periods can reach 20 req/s during low‑latency responses. It crawls primarily over HTTP/1.1 and HTTP/2, with a default concurrency of 8 parallel connections per domain. The bot emits requests from a fixed subnet announced by the Common Crawl project: 54.211.0.0/16 and 52.70.0.0/15 (AWS us‑east‑1). It always sends a User‑Agent header and a From header containing the email address [email protected]. Crawls are scheduled via a Docker‑based pipeline orchestrated by Apache Airflow, and each crawl segment seeds from a fresh list of high‑authority domains obtained from the previous month’s index.

📋 robots.txt Compliance

WebWalk fully respects robots.txt directives as mandated by the Common Crawl Foundation’s code of conduct. It parses the file at the root of each domain before crawling and caches the rules for 24 hours. If a path is explicitly Disallowed, the bot will not request that URL, including any sub‑resources. This behavior is documented in the official webwalk README on GitHub (github.com/commoncrawl/webwalk).

🔍 Detection Indicators

The primary User‑Agent string is Mozilla/5.0 (compatible; WebWalk/2.0; +https://commoncrawl.org/webwalk). A secondary UA WebWalk/1.0 (commoncrawl) may appear on older crawl jobs. The bot always includes a From header: [email protected]. Additionally, DNS reverse lookups on the source IPs resolve to *.ec2‑internal.compute.amazonaws.com. The request pattern typically has a high volume of HEAD requests before GET requests, as the bot pre‑checks content‑type and size.

📊 Data Usage

All data collected by WebWalk becomes part of the Common Crawl dataset, which is released monthly under a permissive license (CC‑BY‑SA). This corpus is used for AI training (e.g., large language models like GPT‑3 and Llama), web search indexing research, academic studies in natural language processing, and competitive intelligence. The dataset is also the foundation for Common Crawl’s WebGraph service, providing link‑graph analytics.

⚙️ Rate Limiting Policy

WebWalk is rate‑limited because, despite its ethical compliance, its monthly full‑internet crawl can generate tens of thousands of requests per domain in a short period, potentially degrading site performance. Threshold‑based blocking (e.g., 50 req/s per IP) is a reasonable site‑owner policy to protect server resources while still allowing the bot’s legitimate, non‑malicious data collection for open‑source AI and research.

Free Bot Analysis

Is Your Site Under Bot Attack Right Now?

Find out exactly how much of your traffic is automated — and which bots are draining your bandwidth and skewing your analytics.

Run Free Bot Scan →

No credit card required  ·  Results in minutes

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.