80legs

Bot User-Agent: 80legs

🤖 Overview

80legs is a commercial web crawling platform operated by 80legs, Inc. (now part of the Netseer group), originally launched in 2007 and based in San Francisco. Its primary purpose is to provide customizable, large-scale web data extraction services for enterprises, researchers, and developers, feeding data into products like 80legs Web Crawler and 80legs Data Marketplace. It is not a search engine bot but an on-demand crawling service that users can configure for specific domains, keywords, or patterns.

🌐 Technical Behavior

80legs employs a distributed architecture of thousands of concurrent workers, each assigned to crawl small portions of a target site. The user-agent string defaults to 80legs (without a trailing slash) and can be customized by clients, leading to frequent variations. Crawl frequency is determined by the user’s configuration, but the platform recommends respecting Crawl-Delay directives and uses a polite crawling policy by default, waiting at least 10 seconds between successive requests to the same host. IP ranges originate from diverse data centers (e.g., AWS, DigitalOcean) and are not published as a static list, making geolocation-based blocking unreliable. The bot supports both HTTP/1.1 and HTTPS, and sends a distinct Via header with value 1.1 80legs when operating through proxies, as documented in their official knowledge base (80legs.com/reference/ua).

📋 robots.txt Compliance

According to 80legs’ official documentation, the bot does honor Disallow directives in robots.txt by default, but clients can disable this behavior for custom crawls if they explicitly acknowledge the change. The platform also respects Crawl-Delay settings and will parse the standard User-agent: 80legs directive. However, because many 80legs crawls run under custom user agents, site administrators may need to explicitly block the 80legs token to catch all instances. A sample robots.txt entry is provided at 80legs.com/robots.

🔍 Detection Indicators

The primary User-Agent string is 80legs (e.g., Mozilla/5.0 (compatible; 80legs)). Secondary identifiers include the X-80legs request header (value true) and a pattern of low-to-moderate request rates originating from unknown IPs with high concurrency. Behavioral fingerprints include rapid but polite iteration over paginated content, and the use of both GET and POST methods for form-based crawling. 80legs’ own FAQ (80legs.com/faq) confirms that the user agent can be customized, so checking for the Via header or the 80legs token in the request path is advised for accurate detection.

📊 Data Usage

The collected data is primarily used for business intelligence applications including price monitoring, product catalog aggregation, news aggregation, academic research, and machine learning training datasets. 80legs also offers a data marketplace where subscribers can purchase crawled datasets for a fee. The platform does not use the data to train a public AI model directly; rather, it serves as a data extraction tool for clients’ own analytics or proprietary systems.

⚙️ Rate Limiting Policy

80legs is rate-limited because its distributed workers can unintentionally overwhelm small or poorly configured servers if left unchecked. The rationale for threshold-based blocking is to protect server stability while still allowing legitimate clients to complete their crawls—webmasters are encouraged to use Crawl-Delay in their robots.txt to enforce a polite rate without blocking the bot entirely. The official recommendation is to set a delay of at least 10 seconds per request for optimal coexistence.

🛡️

Stop Bots. Save Bandwidth. Protect Revenue.

Boteraser automatically detects and blocks unwanted bots — protecting your site from scrapers, DDoS bursts, and credential stuffing attacks without slowing down real visitors.

✅ Start Free Protection

Setup takes under a minute  ·  Free trial available

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.