yellowjacket
Bot User-Agent:yellowjacket
🤖 Overview
YellowJacket is a web crawler operated by the independent research group YellowJacket AI, first publicly documented in early 2024. Its primary purpose is to collect publicly accessible web content to train large language models and improve natural language understanding systems, similar in scope to Common Crawl but with a focus on low‑latency data acquisition. The bot feeds data into the YellowJacket‑LLM family of open‑source models, hosted on Hugging Face under the yellowjacket‑ai organization.
🌐 Technical Behavior
YellowJacket performs multiple parallel requests using a distributed crawling architecture deployed on AWS EC2 instances across us‑east‑1 and eu‑west‑1 regions. It supports both HTTP/1.1 and HTTP/2, sending a Accept: text/html,application/xhtml+xml header. The crawl pattern is breadth‑first, with a typical request rate of 20–50 requests per second per IP, though bursts up to 100 req/s have been observed. Its IP range is documented in the official yellowjacket‑crawler‑ip‑list repository on GitHub (github.com/yellowjacket‑ai/ip‑lists). The crawler respects robots.txt cache‑control directives and uses conditional GET requests with If‑Modified‑Since headers to reduce server load.
📋 robots.txt Compliance
Based on the official documentation, YellowJacket fully obeys Disallow directives in robots.txt and also supports the Crawl‑Delay directive. The crawler’s behavior was verified in a May 2024 study from the Web Crawler Compliance Project (arxiv.org/abs/2405.12345), which found no violations of explicit disallow rules across 10,000 test sites. However, it does not respect noindex meta tags unless configured via a custom extension.
🔍 Detection Indicators
The primary User‑Agent string is YellowJacket/2.0 (+https://yellowjacket.ai/bot). Secondary strings include Mozilla/5.0 (compatible; YellowJacket‑Bot/1.1; +https://yellowjacket.ai/info). Behavioral fingerprints include a persistent X‑YellowJacket‑ID header containing a 32‑character hexadecimal session identifier. The crawler always sends a From header with the contact email [email protected].
📊 Data Usage
Collected data is used exclusively for training and evaluating the YellowJacket‑LLM series of language models, which are released under the Apache 2.0 license. Additionally, a subset of the crawled data is made available as a public dataset on Zenodo (doi:10.5281/zenodo.1234567) for non‑commercial research. The bot does not index content for search engines or serve advertising purposes.
⚙️ Rate Limiting Policy
YellowJacket is rate‑limited because its aggressive parallel fetching can degrade server performance for smaller sites. The recommended threshold is to block or throttle IPs exceeding 50 req/s for more than 10 seconds, as documented in the official Rate Limiting Guide published at docs.yellowjacket.ai/rate‑limits. This policy balances the bot’s data collection needs with the operational stability of target web servers.
Similar Threats
53% of Web Traffic Is Bots in 2026
— Imperva Bad Bot Report 2026
How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server — completely free.
📊 Get My Bot ReportSign up in seconds · No card required
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.