cherrypicker
Bot User-Agent:cherrypicker
🤖 Overview
CherryPicker is a web crawler operated by Cherry (cherry.ai), a data infrastructure company founded in 2022 that focuses on AI training data collection and knowledge graph construction. Launched publicly in August 2023, the bot systematically indexes publicly accessible web content to feed Cherry's proprietary large language model training pipelines, knowledge graph products, and enterprise data-as-a-service offerings. CherryPicker is explicitly described as a legitimate, non-malicious crawler on the official Cherry website at cherry.ai/crawler.
🌐 Technical Behavior
CherryPicker performs distributed crawling using virtual machines hosted on major cloud providers including AWS, Google Cloud, and Azure. It makes HTTP GET requests at an average rate of 1–2 requests per second per domain, with a default Crawl-Delay of 10 seconds as specified in its robots.txt configuration. The bot uses both HTTP/1.1 and HTTP/2 protocols and sends a User-Agent header of CherryPicker/1.0 along with a From header containing [email protected]. IP address ranges are published in a DNS TXT record at _cherrypicker._ip._cherry.ai, as documented in Cherry's crawler FAQ. The bot respects noindex meta tags and nofollow link attributes, and it does not fetch external resources such as images, CSS, or JavaScript. Crawl frequency is adaptive: CherryPicker reduces its rate if it receives HTTP 429 (Too Many Requests) responses.
📋 robots.txt Compliance
CherryPicker fully adheres to robots.txt Disallow directives. Cherry's official policy, published on their website and in a corresponding GitHub repository (github.com/cherry-ai/crawler-policy), states that the bot checks the robots.txt file before every crawl session and will not access URLs blocked by the site owner. Additionally, Cherry provides an opt-out web form at cherry.ai/opt-out for site owners who wish to exclude their entire domain from crawling. Webmasters have reported that CherryPicker reliably respects X-Robots-Tag: none HTTP headers as well.
🔍 Detection Indicators
The primary identifying User-Agent string is CherryPicker/1.0; occasionally the bot uses the shorter CherryPicker without the version suffix. Behavioral fingerprints include a consistent request interval of approximately 1 second, low variance in request timing, and a distinct Accept header of text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8. The bot also sends a Connection: keep-alive header and a Accept-Language: en-US,en;q=0.9 header. It does not execute JavaScript or render pages, making it distinguishable from headless browsers. Server logs show that the remote IP addresses resolve to cloud provider hostnames like ec2-*.compute.amazonaws.com.
📊 Data Usage
Data collected by CherryPicker is used to train Cherry's large language models, which power their AI summarization, question-answering, and knowledge base products. The company also aggregates crawled content into structured knowledge graphs used for enterprise data analytics and research. According to their privacy policy, Cherry stores crawled data temporarily, anonymizes personally identifiable information, and does not sell raw user data. Some content is used for benchmarking AI model performance against public datasets.
⚙️ Rate Limiting Policy
Rate limiting is recommended because CherryPicker can briefly spike to higher request rates during initial domain discovery phases, potentially causing load issues on shared hosting environments. A threshold-based blocking policy, for example returning HTTP 429 when requests exceed 20 per second from a single IP, is appropriate to protect application resources while still allowing the legitimate crawler to operate. Cherry itself advises site owners to use rate limits rather than blocking, as stated in their crawler documentation.
Similar Threats
🛡️
Stop Bots. Save Bandwidth. Protect Revenue.
Boteraser automatically detects and blocks unwanted bots — protecting your site from scrapers, DDoS bursts, and credential stuffing attacks without slowing down real visitors.
✅ Start Free ProtectionSetup takes under a minute · Free trial available
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.