deeptrawl
Bot User-Agent:deeptrawl
๐ค Overview
Deeptrawl is a web crawler operated by Deeptrawl Inc., a data services company that provides high-quality web extraction for AI model training and fine-tuning. Launched in March 2023, the bot systematically collects publicly accessible content to build curated datasets for large language models (LLMs). Official documentation at docs.deeptrawl.com confirms its User-Agent and robots.txt policy.
๐ Technical Behavior
The crawler uses a multi-threaded distributed architecture, rotating IPv4 addresses from Amazon Web Services (AWS) EC2 and Google Cloud Platform (GCP). It averages 2 requests per second per IP but can burst to 10 requests per second when fetching multiple pages from a single domain. It supports HTTP/1.1 and HTTP/2, sends standard Accept headers, and includes a unique X-Deeptrawl-Client: 1 header. The bot follows all standard HTTP redirects and respects both meta robots tags and canonical URLs.
๐ robots.txt Compliance
Deeptrawl fully honors the Robots Exclusion Standard. It checks robots.txt at each domain root before crawling and caches the file for up to 24 hours. Since 2024, it also supports the Crawl-delay directive, allowing webmasters to set a minimum interval between requests.
๐ Detection Indicators
The primary User-Agent is Deeptrawl/1.0 (compatible; Deeptrawl Inc.; +https://deeptrawl.com/bot). A secondary string Deeptrawl-Research/1.0 is used for academic partnerships. Additional headers include From: [email protected] and the custom X-Deeptrawl-Client. Reverse DNS lookups of client IPs often resolve to *.ec2.aws.amazon.com or *.gcp.cloud.google.com.
๐ Data Usage
Collected data is processed into structured datasets sold to research institutions and commercial AI developers. Deeptrawl Inc. also uses the data internally for proprietary model training. Per their privacy policy, they strip personally identifiable information (PII) and sensitive content before distribution.
โ๏ธ Rate Limiting Policy
Due to its persistent high-volume crawling, Deeptrawl is rate-limited to prevent server overload. A recommended policy allows 100 requests per minute per IP from known ranges, with a 24-hour block for exceeding 500 requests per minute, ensuring site stability while permitting legitimate collection.
๐ก๏ธ
Stop Bots. Save Bandwidth. Protect Revenue.
Boteraser automatically detects and blocks unwanted bots โ protecting your site from scrapers, DDoS bursts, and credential stuffing attacks without slowing down real visitors.
โ Start Free ProtectionSetup takes under a minute ยท Free trial available
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.