doy

Bot User-Agent: doy

🤖 Overview

doy is a web crawler operated by the AI research organization Doy Labs, first documented in a 2023 blog post on their official site (doy.ai). The bot is designed to collect publicly accessible web content to train and improve the Doy series of large language models, specifically the Doy-3 and Doy-4 models. Its primary purpose is to index textual data from news, blog posts, and forums to enhance natural language understanding and generation capabilities.

🌐 Technical Behavior

doy performs HTTP/1.1 and HTTP/2 GET requests with a configurable crawl delay, defaulting to 1 request per second per IP address. The bot uses a rotating pool of IPv4 addresses from ASN 398356 (Doy Labs) and ASN 202482 (a cloud provider partner), with ranges such as 45.79.0.0/16 and 104.237.128.0/18. It respects the Accept-Encoding: gzip header and sends a custom From header containing a contact email ([email protected]). Crawl depth is typically limited to 5 levels, and the bot follows rel="nofollow" and rel="noopener" link attributes. It also sends X-Robots-Tag parsing for indexing directives, and supports both robots.txt and Sitemap discovery via the Sitemaps protocol.

📋 robots.txt Compliance

According to the Doy Labs documentation at docs.doy.ai/crawler, doy fully honors Disallow directives in robots.txt, including wildcard patterns and user-agent specific blocks. The bot caches robots.txt for up to 24 hours and re-fetches it upon a 4xx or 5xx response. Evidence from community forums shows that it has been observed respecting Crawl-delay directives set in robots.txt, aligning with the organization's stated policy of "graceful crawling."

🔍 Detection Indicators

The primary User-Agent string is Mozilla/5.0 (compatible; Doy/1.0; +https://doy.ai/bot). A secondary string DoyBot/2.0 is used when crawling HTTPS sites. Behavioral fingerprints include requests arriving from the IP ranges noted above, with a request interval of exactly 1 second when not rate-limited, and a consistent Accept: text/html,application/xhtml+xml header. The bot also sets a cookie doy_session for tracking visit duration within a single crawl session.

📊 Data Usage

Collected data—primarily raw HTML, plain text, and metadata (title, description, keywords)—is stored in the Doy Labs training corpus, called DoyWeb, which is used exclusively for training and fine-tuning the Doy language models. No personal or proprietary data is retained; the bot filters out login pages, payment forms, and explicitly blocked sections. The data is also used for internal research on document understanding and information retrieval.

⚙️ Rate Limiting Policy

doy is rate-limited because its aggressive default crawl rate of 1 request per second can overwhelm smaller servers if multiple concurrent crawls are launched. The policy recommends threshold-based blocking after 100 requests per minute from the same IP, as documented in the Doy Labs rate-limit guide at github.com/doylabs/crawler-policy, to protect server resources while allowing legitimate data collection.

⚠️

Your Site May Be Hemorrhaging Revenue to Bots

Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected — completely free.

Check My Site for Free

Free to start  ·  Cancel anytime

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.