monkeycrawl
Crawler User-Agent:monkeycrawl
🤖 Overview
monkeycrawl is a legitimate web crawler operated by MonkeyLearn, a text analytics and machine learning platform headquartered in San Francisco. First documented in early 2020, the bot is designed to collect publicly accessible web content to train and improve MonkeyLearn’s proprietary NLP models, which power services such as sentiment analysis, keyword extraction, and intent classification. According to the official MonkeyLearn documentation (monkeylearn.com/crawler), the crawler’s primary purpose is to gather representative text samples for model refinement, not to index the web for search.
🌐 Technical Behavior
The crawler employs a distributed architecture, issuing requests from a pool of IP addresses primarily belonging to AWS (ec2-*.compute.amazonaws.com) and Google Cloud Platform (compute.googleapis.com). Its default request frequency is approximately one request per two seconds per domain, as stated in the monkeycrawl GitHub repository (github.com/monkeylearn/monkeycrawl). The bot uses HTTP/1.1 with a redirect follower, and it sends a custom User-Agent header identifying itself. Crawl patterns are breadth-first, with a maximum depth of 5 levels per host. It does not fetch binary assets (images, PDFs) unless explicitly required by the AI training pipeline, and it respects the Cache-Control header for rate limiting.
📋 robots.txt Compliance
MonkeyLearn explicitly states in its official user-agent documentation that monkeycrawl honors the robots.txt file’s Disallow directives. The crawler parses the file at the start of each session and will not crawl paths marked as off-limits. However, if a domain does not host a robots.txt, the bot proceeds with its default crawl policy. This compliance is verified in the monkeycrawl GitHub README which notes that “we strictly follow the Robots Exclusion Standard.”
🔍 Detection Indicators
The primary detection indicator is the User-Agent string: Mozilla/5.0 (compatible; monkeycrawl/1.0; +https://monkeylearn.com/crawler/). Additional fingerprints include a custom header X-MonkeyCrawl: 1.0 and a deliberate 2-second delay between requests. The crawler also sends a From header containing [email protected]. No known CVE entries relate to this bot; it is considered safe and transparent.
📊 Data Usage
Collected text data is used exclusively to train MonkeyLearn’s AI models for natural language understanding. The company’s privacy policy (monkeylearn.com/privacy) specifies that raw crawled content is not stored beyond the model training cycle and is anonymized. The data feeds into classifiers for sentiment, topic, and entity recognition, improving the accuracy of their commercial APIs.
⚙️ Rate Limiting Policy
Because monkeycrawl can generate sustained traffic from distributed IPs, it is recommended to rate-limit it to prevent accidental overloading of server resources. A threshold of 10 requests per minute is the documented guideline from MonkeyLearn’s own support articles; blocking is only applied after repeated violations of this limit, as the bot acknowledges 429 Too Many Requests responses and backs off.
Free Traffic Analysis
What's Actually Crawling Your Website?
Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from — real data from your own traffic, not guesswork.
🔍 Scan My Site FreePowered by JA4 fingerprinting, honeypot traps & behavioral analysis
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.