jennybot

Bot User-Agent: jennybot

🤖 Overview

jennybot is a web crawler operated by Jenny AI Inc., a San Francisco-based company specializing in AI-powered web agents. First publicly documented in early 2024, the bot is designed to collect publicly accessible web content for training proprietary large language models used in Jenny’s conversational AI platform. According to the official Jenny AI documentation at https://jenny.ai/bots, the crawler indexes text, images, and structured data from websites to improve the factual accuracy and breadth of its AI responses.

🌐 Technical Behavior

jennybot employs a distributed crawling architecture using IP addresses primarily from Amazon Web Services (AWS EC2) and Google Cloud Platform, with ranges published in the project’s GitHub repository at https://github.com/jenny-ai/crawler-ips. The bot sends a steady stream of HTTP/1.1 GET requests with a default interval of 200 milliseconds between requests, equating to approximately 5 requests per second per IP. It respects Cache-Control headers and uses ETag and If-Modified-Since to avoid re-downloading unchanged resources. The crawler follows all rel="nofollow" and rel="noindex" link directives and limits its crawl depth to 10 levels by default. It does not execute JavaScript, relying solely on raw HTML parsing to extract data.

📋 robots.txt Compliance

Based on the official robots.txt policy published at https://jenny.ai/robots.txt, jennybot fully honors Disallow directives and respects Crawl-Delay instructions. The bot checks robots.txt at the start of each crawl session and caches the file for up to 24 hours, as verified by independent testing reported in the Web Crawler Compliance Database (2024). There is no evidence of any violation of robots.txt rules; the operator explicitly states compliance in their documentation.

🔍 Detection Indicators

The primary User-Agent string is "Mozilla/5.0 (compatible; JennyBot/1.0; +https://jenny.ai/bot)". Additionally, the bot includes a custom HTTP header X-Jenny-Crawler: 1 and a From header containing the contact email [email protected]. Behavioral fingerprints include a consistent request pattern with a fixed 200ms gap and no support for gzip encoding by default, as noted in server logs from major websites. The bot’s IPs are listed in the Jenny AI IP range JSON at the GitHub repository.

📊 Data Usage

Collected data is used exclusively for training Jenny AI’s foundation models, including its conversational agent and code generation tools. The company states in its privacy policy (https://jenny.ai/privacy) that raw crawl data is not sold or shared with third parties; it is processed, anonymized, and incorporated into training corpora. Jenny AI also uses the data to improve search relevance in its own internal search engine, Jenny Search, as described in the product blog at https://blog.jenny.ai/crawling-insights.

⚙️ Rate Limiting Policy

jennybot is rate-limited because its default crawl speed, while moderate, can still generate significant traffic on high-volume pages, and the bot does not automatically reduce its rate based on server response times. Web administrators are advised to set a Crawl-Delay of 10 seconds in robots.txt or implement IP-based rate limiting above 20 requests per second to ensure fair resource allocation without blocking legitimate indexing.

⚠️

Your Site May Be Hemorrhaging Revenue to Bots

Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected — completely free.

Check My Site for Free

Free to start  ·  Cancel anytime

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.