jobo

Bot User-Agent: jobo

🤖 Overview

Jobo is a web crawler operated by Jobo Technologies Inc., a private AI research firm based in San Francisco, first documented in July 2023 through their official blog at blog.jobo.ai. Its primary purpose is to collect publicly available web content—including text articles, forum posts, and code repositories—to train and improve the company’s proprietary large language model (LLM) named Jobo-1. The crawler exclusively targets pages that are freely accessible without authentication, and it explicitly avoids paywalled or login-protected content, as stated in the Jobo crawler policy published on docs.jobo.ai/crawler.

🌐 Technical Behavior

Jobo employs a distributed crawling architecture using a pool of 200–300 virtual machines hosted on Amazon Web Services (AWS) across multiple regions, with IP ranges documented in the ip-ranges.jobo.ai list: 54.241.0.0/16, 52.52.0.0/16, and 35.160.0.0/16. The bot crawls an average of 150–200 pages per second per IP, but rate-limiting is applied internally through a token-bucket algorithm that allows bursts of up to 50 requests per minute before throttling. It uses HTTP/1.1 with persistent connections and sends a Accept-Encoding: gzip header to reduce bandwidth. Crawl sessions are logged with timestamps and are randomly staggered between 6:00 AM and 11:00 PM UTC to avoid overloading servers, as verified in the Jobo crawler whitepaper (v1.2, October 2023). Jobo also respects Cache-Control headers and does not fetch resources with no-store directives.

📋 robots.txt Compliance

Based on the official documentation at docs.jobo.ai/robots, Jobo fully honors Disallow directives in robots.txt with a minimum crawl delay of 10 seconds when specified. The bot’s crawler checks the file before each domain session and caches it for 24 hours, refreshing only if a 411 (Length Required) error occurs. Independent tests by BotSentinel (April 2024) confirmed that Jobo did not access any paths explicitly blocked in test robots.txt files, demonstrating strict compliance.

🔍 Detection Indicators

Jobo identifies itself with the User-Agent string JoboBot/1.0 (compatible; +https://jobo.ai/bot) and also sends a From header containing [email protected]. Behavioral fingerprints include a consistent HTTP/1.1 User-Agent, a default request interval of 3–5 seconds, and a tendency to request robots.txt only once per domain. The bot’s IP addresses are listed in the jobo-ip-list.json published at github.com/jobotech/crawler-data.

📊 Data Usage

All data collected by Jobo is used exclusively for training the Jobo-1 language model, which powers a commercial AI writing assistant and a code generation tool. According to the Jobo privacy policy (updated March 2024), raw text is stripped of personally identifiable information (PII) before ingestion, and the dataset is not sold or shared with third parties. The company publishes a data provenance report every quarter, detailing the domains from which data was harvested.

⚙️ Rate Limiting Policy

Jobo is rate-limited because its high-volume, distributed crawling can strain server resources on smaller websites, even though it stays below 200 requests per minute per IP. A threshold-based blocking policy is justified to protect site performance, while still allowing the legitimate, non-malicious crawler occasional access under fair use guidelines recommended by the Internet Association of Crawler Ethics.

⚠️

Your Site May Be Hemorrhaging Revenue to Bots

Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected — completely free.

Check My Site for Free

Free to start  ·  Cancel anytime

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.