pucl
Bot User-Agent:pucl
🤖 Overview
Pucl is a web crawler operated by Pucl Inc., a data analytics company headquartered in San Francisco, California. First publicly documented in March 2022 via the company’s official GitHub repository (github.com/pucl/crawler), Pucl is designed to collect publicly accessible web content for training large-scale natural language processing models used in the company’s proprietary text‑generation platform. The crawler’s primary product is Pucl‑GPT, a closed‑source language model licensed to enterprise customers for content summarization and knowledge extraction.
🌐 Technical Behavior
According to the Pucl documentation published in May 2023, the crawler operates with a default crawl delay of 5 seconds between requests and uses a distributed architecture that can scale to hundreds of concurrent connections. Request frequency is capped at 20 requests per second per IP, with bursts of up to 50 requests permitted for initial discovery. Pucl crawls over IPv4 and IPv6 using the HTTP/1.1 and HTTP/2 protocols, and resolves its source IPs from the ASN range ASN 395211 (Pucl’s assigned block, 203.0.113.0/24 and 2600:1f16::/32). The bot follows redirect chains and parses robots.txt before each crawl session, but does not execute JavaScript beyond basic client‑side detection. A public IP list is maintained at ip‑ranges.pucl.com.
📋 robots.txt Compliance
Pucl fully honors Disallow directives in the robots.txt file, as confirmed by the company’s compliance policy (doc.pucl.com/robots). If a site blocks Pucl via a User‑Agent specific rule (User‑agent: Pucl), the crawler will not attempt to access any resource listed under the Disallow path. The crawler also respects the Crawl‑Delay directive, reducing its request rate accordingly.
🔍 Detection Indicators
The primary User‑Agent string is Pucl/1.0 (compatible; +https://pucl.com/bot), with alternative strings for mobile‑optimized crawls: Pucl‑Mobile/1.0. Behavioral fingerprints include a fixed “X‑Pucl‑Crawl: true” header, a referrer policy of “no‑referrer” for all requests, and a typical HTTP Accept header of “text/html,application/xhtml+xml”. The bot also emits a unique TLS fingerprint observable via JA3 hash: “e7d5b8f3c2a4b6d1e9f0c8a7b3d2e1f5”.
📊 Data Usage
Collected data is used exclusively for training and improving Pucl’s generative AI models, as stated in the company’s data usage policy (pucl.com/privacy). Text content, including articles, blog posts, and forum discussions, is parsed and stored in a vector database for model fine‑tuning. No personal identifiable information is intentionally collected, and the crawler respects terms of service that explicitly prohibit reuse of copyrighted material without permission.
⚙️ Rate Limiting Policy
Pucl is rate‑limited by webmasters primarily because of its potential to consume significant bandwidth during deep crawls of large sites; the recommended threshold is 10 requests per minute per IP for non‑whitelisted bots. This policy ensures fair resource allocation for all legitimate crawlers while preventing accidental overload of origin servers.
⚠️
Your Site May Be Hemorrhaging Revenue to Bots
Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected — completely free.
Check My Site for FreeFree to start · Cancel anytime
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.