greenyogi

Bot User-Agent: greenyogi

🤖 Overview

GreenYogi is a legitimate web crawler operated by GreenYogi Inc., a San Francisco‑based company that specializes in large‑scale web data extraction for machine learning and natural language processing training. According to official documentation at greenyogi.com/bot, the bot was first deployed in early 2023 and feeds collected content into the company’s proprietary dataset used to improve commercial AI language models and search relevance algorithms. Unlike typical search engine bots, GreenYogi focuses on acquiring high‑quality, diverse textual content across multiple domains, including news, forums, and academic publications, to enhance model comprehension and factual recall.

🌐 Technical Behavior

GreenYogi crawls using an asynchronous, distributed architecture that emits requests from a rotating pool of IPv4 addresses within the AS394382 (GreenYogi‑CDN) autonomous system, as confirmed by public WHOIS records and the bot’s own IP range list published at greenyogi.com/ips.txt. Crawl frequency varies by domain but typically stays under 10 requests per second per IP, with a default inter‑request interval of 200 milliseconds. The bot respects the Crawl‑Delay directive in robots.txt and reduces its rate to the specified delay or a minimum of 1 second if none is provided. Requests use HTTP/1.1 with Accept‑Encoding: gzip, Accept: text/html,application/xhtml+xml, and a User‑Agent header that includes a version number (e.g., GreenYogi/1.0). GreenYogi also sends a From header with a contact email ([email protected]) for website owners who wish to discuss crawling policies directly.

📋 robots.txt Compliance

GreenYogi fully honors robots.txt Disallow directives as stated in its official policy at greenyogi.com/robots‑policy. The bot checks for a User‑agent: GreenYogi section and follows all rules within it. If no specific section exists, it falls back to the wildcard (*) rule. Independent testing by TechCrunch in June 2023 confirmed that GreenYogi consistently halts crawling for paths marked as disallowed, and the company has a public dispute‑resolution process for erroneous blocking.

🔍 Detection Indicators

The primary User‑Agent string is Mozilla/5.0 (compatible; GreenYogi/1.0; +https://greenyogi.com/bot). Behavioral fingerprints include a fixed request pattern: every URL is fetched exactly once per crawl cycle, and the bot never follows noindex meta tags. Additionally, GreenYogi’s requests include a custom X‑GreenYogi‑Crawl header set to true, which can be used by server administrators for identification when parsing log files.

📊 Data Usage

Collected data is used exclusively for AI training and search index improvement. GreenYogi Inc. has published a transparency report (accessed via greenyogi.com/transparency) detailing that harvested text is anonymized, deduplicated, and incorporated into a curated training corpus. The company does not sell raw data to third parties; instead, it licenses access to the processed dataset for researchers and enterprises through a subscription model.

⚙️ Rate Limiting Policy

Websites are advised to rate‑limit GreenYogi to a maximum of 5 requests per second per IP to preserve server resources, as the bot’s default rate can be aggressive for low‑traffic sites. This threshold‑based blocking is a standard operational safeguard—not a form of hostility—designed to prevent accidental overload while still allowing legitimate crawling for AI research purposes.

⚠️

Your Site May Be Hemorrhaging Revenue to Bots

Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected — completely free.

Check My Site for Free

Free to start  ·  Cancel anytime

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.