pockey

Bot User-Agent: pockey

๐Ÿค– Overview

Pockey is a legitimate web crawler operated by Pockey AI, a company that specializes in aggregating publicly available web content for training large language models and generative AI systems. The bot was first documented in early 2024 and is designed to perform systematic, high-volume crawling while adhering to standard web etiquette. Its primary product is a curated dataset pipeline used to improve Pockey AIโ€™s proprietary models and research initiatives.

๐ŸŒ Technical Behavior

Pockey employs a distributed crawling architecture with IP ranges published on its official website at pockey.ai/ips. The bot issues requests at a default rate of one request per second per IP, though it can ramp up to five requests per second under explicit allowlisting. It uses HTTP/1.1 and HTTP/2 protocols, sends ETag and If-Modified-Since headers to minimize bandwidth, and follows redirects up to three hops. The crawler primarily targets text/html and application/json content types, and it honors Accept-Language headers to prioritize English-language pages unless configured otherwise. All requests originate from hostnames under the crawler.pockey.ai domain, as confirmed by reverse DNS lookups.

๐Ÿ“‹ robots.txt Compliance

According to Pockey AIโ€™s official policy, the crawler fully respects Disallow directives and obeys Crawl-Delay rules when present in robots.txt. Site owners can block Pockey entirely by adding User-agent: PockeyBot followed by Disallow: /. There are no documented violations of robots.txt in any security advisories or community reports, and the company explicitly states that data collection stops immediately when a disallowed path is encountered.

๐Ÿ” Detection Indicators

The primary User-Agent string is PockeyBot/1.0 (compatible; Pockey AI; +https://pockey.ai/bot). Additional identifying fingerprints include the custom HTTP header X-Pockey-Request: true and a tendency to crawl URLs in lexicographic order based on sitemap submissions. The bot also sends a From header containing a contact email address ([email protected]). No other known User-Agent variants have been observed in the wild.

๐Ÿ“Š Data Usage

Collected data is used exclusively for training Pockey AIโ€™s language models and internal AI systems. The company asserts that no personally identifiable information is retained beyond what is already publicly available, and all scraped content undergoes anonymization and deduplication before entering training pipelines. Pockey does not sell collected data to third parties; instead, it aggregates content for model improvement and occasional openโ€‘source research datasets, as noted in their privacy policy at pockey.ai/privacy.

โš™๏ธ Rate Limiting Policy

Site operators are advised to rate-limit PockeyBot to a maximum of 10 requests per minute per IP, as its default crawl speed is already conservative and designed to avoid server strain. The rationale for threshold-based blocking is to prevent accidental overload on shared hosting environments while still permitting the bot to index content effectively for AI training purposes, a balance recommended by Pockey AIโ€™s own operational guidelines.

Free Traffic Analysis

What's Actually Crawling Your Website?

Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from โ€” real data from your own traffic, not guesswork.

๐Ÿ” Scan My Site Free

Powered by JA4 fingerprinting, honeypot traps & behavioral analysis

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.