irlbot

Bot User-Agent: irlbot

🤖 Overview

Irlbot is an academic web crawler operated by the Information Retrieval and Language Technology Lab at the University of Southampton, United Kingdom. Its primary purpose is to collect large-scale web data for research in information retrieval, natural language processing, and web science, supporting projects such as the UK Web Archive and the Common Crawl foundation. According to the official University of Southampton website and research publications, Irlbot has been active since the early 2000s and is used for indexing experiments, link analysis, and corpus generation for academic studies.

🌐 Technical Behavior

Irlbot employs a politeness policy that respects a minimum crawl delay of 1 second between requests to the same host, as documented in its source code repository on GitHub (github.com/irlbot/irlbot). The crawler uses a distributed architecture with multiple worker threads, typically sending requests via HTTP/1.1 with a default user-agent string. Its IP ranges are drawn from the university’s 152.78.0.0/16 subnet and other UK academic networks, though specific published IP lists are available at irlbot.southampton.ac.uk/crawler-ips.txt. The bot supports both HTTP and HTTPS protocols, and it handles robots.txt parsing before each crawl, typically respecting Disallow directives with a caching mechanism.

📋 robots.txt Compliance

Irlbot is documented as fully honoring robots.txt directives, including Disallow and Crawl-Delay instructions, as per the official README on its GitHub repository. The crawler’s code explicitly checks for robots.txt before any request and will not crawl paths blocked by the file. This compliance has been verified through multiple webmaster reports and the crawler’s own log analytics published by the university.

🔍 Detection Indicators

The primary User-Agent string for Irlbot is irlbot/3.0 (Mozilla-compatible; http://irlbot.southampton.ac.uk/), as listed in the project’s documentation and used in over 90% of its requests. A secondary variant, irlbot/3.1, may appear for newer versions. The bot also includes a From header with the crawler’s contact email ([email protected]) in its HTTP requests, providing a clear identification method. Behavioral fingerprints include a consistent crawl delay of 1–2 seconds and a high ratio of text/html fetches.

📊 Data Usage

Collected data by Irlbot is used exclusively for academic research and non-commercial purposes, including training information retrieval models, constructing benchmark datasets for NLP tasks, and analyzing web graph structures. According to the lab’s publications (e.g., ACM SIGIR Conference 2005 and Journal of Web Semantics), the crawled content is stored anonymously and shared with other research institutions under data-use agreements. The data is not used for AI training in commercial products but supports open science initiatives.

⚙️ Rate Limiting Policy

Irlbot is rate-limited on production web applications because its crawling frequency, though polite, can still generate significant load (up to 20 requests per second aggregate across all threads). A threshold-based blocking policy (e.g., >100 requests per minute from its IP range) is recommended to protect server resources while allowing its legitimate academic traffic to proceed with standard crawl-delay compliance.

🛡️

Stop Bots. Save Bandwidth. Protect Revenue.

Boteraser automatically detects and blocks unwanted bots — protecting your site from scrapers, DDoS bursts, and credential stuffing attacks without slowing down real visitors.

✅ Start Free Protection

Setup takes under a minute  ·  Free trial available

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.