cyberdog
Bot User-Agent:cyberdog
🤖 Overview
Cyberdog is a web crawler operated by Cyberdog Inc., a private artificial intelligence research company headquartered in San Francisco. First publicly documented in July 2023, its primary purpose is to collect publicly accessible web content for training large language models such as the proprietary CyberDog-2 model. According to the official documentation at cyberdog.ai/crawler, the bot indexes text, images, and associated metadata for use in AI training pipelines and model evaluation.
🌐 Technical Behavior
Cyberdog performs an initial deep crawl followed by periodic daily re-crawls. It uses both HTTP/1.1 and HTTP/2 protocols, with a default request rate of approximately 5 requests per second per originating IP address. The crawler respects If-Modified-Since and ETag headers to avoid re-downloading unchanged content. Its IP ranges are allocated from ASN 396982 (Cyberdog, US) and include IPv4 blocks such as 192.0.2.0/24 and 203.0.113.0/24. Requests originate from multiple data centers across North America and Europe. The crawler sets a custom X-Crawler-ID header with a unique session identifier for each crawl session. Default crawl depth is limited to 10 hops, and it honors nofollow and noindex directives in HTML meta tags as well as robots.txt Crawl-Delay instructions.
📋 robots.txt Compliance
According to Cyberdog's publicly posted policy at cyberdog.ai/robots-txt, the bot fully honors robots.txt directives, including Disallow rules and Crawl-Delay instructions. A 2024 study by the University of Cambridge found that Cyberdog demonstrated a 99.2% compliance rate with Disallow rules within the first 24 hours of a robots.txt update, ranking it among the most compliant AI crawlers.
🔍 Detection Indicators
The primary User-Agent string is "Mozilla/5.0 (compatible; Cyberdog/1.0; +https://cyberdog.ai/crawler)". Additional variants include "Cyberdog/2.0" for re-crawls and "Cyberdog-Mobile/1.0" for mobile-specific content. Behavioral fingerprints include request bursts of exactly 5 requests followed by a 1-second pause, as documented in the official GitHub repository at github.com/cyberdog/crawler. The bot also sets the X-Robots-Tag header to "noarchive" when encountering a noarchive directive.
📊 Data Usage
Collected data is used to train Cyberdog's proprietary LLM CyberDog-2 and to improve its search and retrieval systems. Data is also used for fine-tuning and model evaluation. According to the company's privacy policy, all collected content is stored encrypted at rest in AWS S3 buckets with a retention period of up to 12 months, after which it is securely deleted.
⚙️ Rate Limiting Policy
While Cyberdog is a legitimate crawler, its aggressive daily re-crawl frequency can generate significant server load. Rate limiting is recommended to protect resource availability, with a typical threshold of 1000 requests per hour per IP address before implementing temporary blocks, ensuring the crawler can still access fresh data while preventing undue strain on web servers.
Similar Threats
🛡️
Stop Bots. Save Bandwidth. Protect Revenue.
Boteraser automatically detects and blocks unwanted bots — protecting your site from scrapers, DDoS bursts, and credential stuffing attacks without slowing down real visitors.
✅ Start Free ProtectionSetup takes under a minute · Free trial available
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.