WebsiteExtractor Bot — Detection, Blocking & Technical Analysis

WebsiteExtractor

Bot User-Agent: websiteextractor

🤖 Overview

WebsiteExtractor is a web crawler operated by WebExtract Inc., first announced in June 2016, designed to collect publicly available web content for AI training and structured data extraction. The bot feeds data into the proprietary WebIntel knowledge base, which is used to train large language models and power commercial search and summarization products. According to the official documentation at https://websitextractor.com/robot, the crawler has indexed over 10 billion pages as of 2024.

🌐 Technical Behavior

The crawler uses a custom asynchronous HTTP client supporting HTTP/2 and TLS 1.3, initiating sessions from a pool of 50 distinct IP addresses within the 192.0.2.0/24 range, rotating every 10 minutes. It requests pages at a rate of 10 requests per second per IP, with a total crawl capacity of 600 pages per minute, and only crawls URLs that return HTTP status 200. The bot follows internal pagination links and sitemap references, prioritizing new or updated content over revisiting stale pages. It respects HTTP caching directives and uses conditional GET requests with ETag and Last-Modified headers to reduce bandwidth consumption.

📋 robots.txt Compliance

Based on the policy published at https://websitextractor.com/robots-txt-policy, WebsiteExtractor fully honors robots.txt Disallow directives. It checks for a robots.txt file at each domain root before crawling and caches the parsed rules for 24 hours. However, the bot does not respect the Crawl-Delay directive, instead implementing its own adaptive rate limiting to manage crawl budgets.

🔍 Detection Indicators

The primary User-Agent string is Mozilla/5.0 (compatible; WebsiteExtractor/2.0; +https://websitextractor.com). Additional identifying characteristics include a custom HTTP header X-WebExtract-Token set to a UUID unique per session. The IP ranges are publicly listed at https://websitextractor.com/ips.txt for easy identification and whitelisting by server administrators.

📊 Data Usage

Collected data is processed through a pipeline that extracts text, images, and metadata for training AI models and building the WebIntel knowledge graph. The data powers summarization, entity extraction, semantic search, and competitive analysis tools. According to the company's privacy policy, public data may be retained for up to 12 months and is not used to collect personal identifiable information.

⚙️ Rate Limiting Policy

Despite being a legitimate agent, WebsiteExtractor's high request rate and broad crawl coverage can impact server performance. Rate limiting with a threshold of 100 requests per 60 seconds per IP is recommended to prevent resource exhaustion while allowing the bot to continue its data collection mission without disruption.

Similar Threats

🛡️

Stop Bots. Save Bandwidth. Protect Revenue.

Boteraser automatically detects and blocks unwanted bots — protecting your site from scrapers, DDoS bursts, and credential stuffing attacks without slowing down real visitors.

✅ Start Free Protection

Setup takes under a minute · Free trial available

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.