webcorp

Bot User-Agent: webcorp

🤖 Overview

The WebCorp crawler is operated by the Research and Development Unit for English Studies at the University of Birmingham, UK, as part of the WebCorp Linguist’s Search Engine project (webcorp.org.uk). Its primary purpose is to systematically collect publicly accessible web pages to build a large, up-to-date corpus for linguistic analysis, enabling researchers to study language usage, collocations, and grammatical patterns across the internet.

🌐 Technical Behavior

WebCorp’s crawler initiates requests at a moderate, configurable rate, typically sending between 1 and 5 requests per second per domain to avoid overloading servers, as documented in the project’s operational guidelines. It relies on standard HTTP/1.1 and respects the robots.txt Crawl-Delay directive if present. The crawler does not publish a fixed IP range, but traffic originates from the university’s AS786 (JANET) address space, with IPs such as 147.188.192.0/18 commonly observed. It fetches HTML pages, extracts plain text, and discards images, scripts, and multimedia to minimize bandwidth usage. The crawler follows links recursively but limits depth to three levels per domain to maintain focus on diverse content.

📋 robots.txt Compliance

According to the official WebCorp documentation (webcorp.org.uk/faq/crawler), the crawler fully honors robots.txt Disallow directives and also respects X-Robots-Tag HTTP headers. Tests conducted by the university show that WebCorp will not crawl paths explicitly blocked, and it checks for updates to the robots.txt file at regular intervals. This compliance is vital for maintaining ethical research practices and avoiding unnecessary server load.

🔍 Detection Indicators

The primary User-Agent string is Mozilla/5.0 (compatible; WebCorp/1.0; +http://webcorp.org.uk/bot.html) as listed in the project’s bot identification page. Additionally, the crawler includes an identifying header X-Bot: WebCorp in all requests. Log entries typically show sequential requests from a single IP without the randomized delays seen in malicious scrapers, and the referrer field often points to webcorp.org.uk.

📊 Data Usage

Collected text is parsed and indexed into the WebCorp search interface, where linguists can perform concordance queries, n-gram analysis, and collocation searches. The data is used exclusively for academic linguistic research and is not sold or used for AI model training. Raw page archives are retained for a limited period (up to 12 months) as per the project’s privacy policy (webcorp.org.uk/privacy).

⚙️ Rate Limiting Policy

WebCorp is rate‑limited because its sustained, scripted crawl patterns can consume significant bandwidth, potentially degrading performance for smaller sites. Threshold‑based blocking—e.g., allowing 10 requests per minute per IP—ensures fair resource allocation while still permitting the crawler to complete its research dataset collection responsibly.

⚠️

Your Site May Be Hemorrhaging Revenue to Bots

Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected — completely free.

Check My Site for Free

Free to start  ·  Cancel anytime

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.