pockey
Bot User-Agent:pockey
๐ค Overview
Pockey is a legitimate web crawler operated by Pockey AI, a company that specializes in aggregating publicly available web content for training large language models and generative AI systems. The bot was first documented in early 2024 and is designed to perform systematic, high-volume crawling while adhering to standard web etiquette. Its primary product is a curated dataset pipeline used to improve Pockey AIโs proprietary models and research initiatives.
๐ Technical Behavior
Pockey employs a distributed crawling architecture with IP ranges published on its official website at pockey.ai/ips. The bot issues requests at a default rate of one request per second per IP, though it can ramp up to five requests per second under explicit allowlisting. It uses HTTP/1.1 and HTTP/2 protocols, sends ETag and If-Modified-Since headers to minimize bandwidth, and follows redirects up to three hops. The crawler primarily targets text/html and application/json content types, and it honors Accept-Language headers to prioritize English-language pages unless configured otherwise. All requests originate from hostnames under the crawler.pockey.ai domain, as confirmed by reverse DNS lookups.
๐ robots.txt Compliance
According to Pockey AIโs official policy, the crawler fully respects Disallow directives and obeys Crawl-Delay rules when present in robots.txt. Site owners can block Pockey entirely by adding User-agent: PockeyBot followed by Disallow: /. There are no documented violations of robots.txt in any security advisories or community reports, and the company explicitly states that data collection stops immediately when a disallowed path is encountered.
๐ Detection Indicators
The primary User-Agent string is PockeyBot/1.0 (compatible; Pockey AI; +https://pockey.ai/bot). Additional identifying fingerprints include the custom HTTP header X-Pockey-Request: true and a tendency to crawl URLs in lexicographic order based on sitemap submissions. The bot also sends a From header containing a contact email address ([email protected]). No other known User-Agent variants have been observed in the wild.
๐ Data Usage
Collected data is used exclusively for training Pockey AIโs language models and internal AI systems. The company asserts that no personally identifiable information is retained beyond what is already publicly available, and all scraped content undergoes anonymization and deduplication before entering training pipelines. Pockey does not sell collected data to third parties; instead, it aggregates content for model improvement and occasional openโsource research datasets, as noted in their privacy policy at pockey.ai/privacy.
โ๏ธ Rate Limiting Policy
Site operators are advised to rate-limit PockeyBot to a maximum of 10 requests per minute per IP, as its default crawl speed is already conservative and designed to avoid server strain. The rationale for threshold-based blocking is to prevent accidental overload on shared hosting environments while still permitting the bot to index content effectively for AI training purposes, a balance recommended by Pockey AIโs own operational guidelines.
Similar Threats
Free Traffic Analysis
What's Actually Crawling Your Website?
Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from โ real data from your own traffic, not guesswork.
๐ Scan My Site FreePowered by JA4 fingerprinting, honeypot traps & behavioral analysis
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.