lachesis

Bot User-Agent: lachesis

🤖 Overview

Lachesis is a high-performance web crawler developed by the University of Cambridge Computer Laboratory as part of the Lachesis research project, first publicly documented in a 2016 paper titled "Lachesis: A Scalable Web Crawler for the Web of Data". The crawler is designed to systematically collect web pages for academic studies in web science, including link analysis, content evolution, and web graph characterization. Unlike commercial search engine bots, Lachesis operates under strict research ethics guidelines and publishes its crawl data for non-commercial use.

🌐 Technical Behavior

Lachesis employs a breadth-first crawling strategy with configurable politeness delays, typically set to a minimum of 5 seconds between requests to the same domain. The crawler supports both HTTP/1.1 and HTTP/2 protocols and uses a distributed agent architecture that can scale to thousands of concurrent requests across multiple machines. Its IP addresses originate from the University of Cambridge's address space (e.g., 128.232.0.0/16) and a small pool of academic proxies. Lachesis respects the Robots Exclusion Protocol by parsing robots.txt files before every crawl session and caches the results for up to 24 hours. The crawler also handles 404 and 503 responses gracefully, backing off exponentially on server errors.

📋 robots.txt Compliance

According to the Lachesis project documentation on its GitHub repository (github.com/cambridge/lachesis-crawler), the crawler strictly adheres to all Disallow directives in robots.txt. It does not ignore any rules even if the file contains syntax errors; instead, it uses a lenient parser to interpret common variations. Empirical studies have shown that Lachesis fully honors per-path and per-agent restrictions, making it one of the most compliant research crawlers.

🔍 Detection Indicators

The primary User-Agent string is Lachesis/1.0 (University of Cambridge; +https://lachesis.cl.cam.ac.uk/bot.html). Additional variations include Lachesis/1.1 and Lachesis/2.0 for different crawl phases. The bot also sends a From header with the contact email [email protected] and a User-Agent field that always contains the project URL. Behavioral fingerprints include consistent 5-second minimum delays and lack of any JavaScript execution or content evaluation.

📊 Data Usage

Collected data is used exclusively for academic research, including the construction of web graphs, temporal analysis of web content changes, and training of web science algorithms. The Lachesis team publishes aggregated crawl statistics and sample datasets on the project website, but raw page content is not redistributed to avoid copyright issues. No commercial AI training or indexing is performed.

⚙️ Rate Limiting Policy

Although Lachesis is a legitimate research crawler, its aggressive scale (up to thousands of requests per minute across a fleet) can overwhelm under-provisioned servers. Administrators should rate-limit requests from the Cambridge IP ranges to a reasonable threshold (e.g., 10 requests per second per instance) and ensure compliance with the crawler's own delay settings to protect site stability.

⚠️

Your Site May Be Hemorrhaging Revenue to Bots

Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected — completely free.

Check My Site for Free

Free to start  ·  Cancel anytime

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.