polybot
Bot User-Agent:polybot
🤖 Overview
Polybot is a web crawler operated by Poly AI Inc., a company specializing in enterprise conversational AI and knowledge retrieval systems. According to Poly’s official documentation (poly.ai/crawler), the bot is designed to index publicly accessible web content—such as product pages, support articles, and technical documentation—to train and improve the company’s proprietary natural language understanding models used in customer‑service chatbots and internal knowledge‑base tools. The crawler was first publicly documented in April 2023 and is fully distinct from other commercial bots like GPTBot or ClaudeBot.
🌐 Technical Behavior
Polybot performs HTTP/1.1 and HTTP/2 requests from IP ranges registered to Amazon Web Services (ASN 16509), with a default crawl frequency of 5 requests per second per domain, as confirmed by Poly’s technical white paper on crawling ethics (version 1.2). The bot uses a breadth‑first traversal algorithm, prioritizing pages with high link depth (up to 4 levels) and respecting the Cache‑Control header if set to no‑cache. Requests are made with a default Accept‑Encoding: gzip, deflate, br and a From header containing a contact email ([email protected]). Polybot also supports the If‑Modified‑Since header to reduce bandwidth usage on unchanged resources. All traffic is sent via HTTPS only, and the bot does not execute JavaScript or parse rendered content.
📋 robots.txt Compliance
Polybot fully honors robots.txt directives as documented in its official crawler policy page (poly.ai/crawler/robots). It checks the file on every domain before crawling and caches the parsed rules for 24 hours. The bot explicitly respects Disallow rules for paths and user‑agent specific directives, and will stop crawling immediately if a 404 or 410 is returned for the robots.txt URL. A dedicated Allow directive for /public paths is also supported.
🔍 Detection Indicators
The primary User‑Agent string is Polybot/1.0, with variants like Polybot-Enterprise/2.0 for premium tiers. Secondary identifying headers include X‑Crawler‑ID: polybot and a unique From header showing the contact email. Behavioral fingerprints include a consistent request interval of 200 milliseconds between pages and a preference for text/html over application/pdf content types. Polybot also sets a custom Accept‑Language header of en‑US,en;q=0.9.
📊 Data Usage
Collected data is used exclusively for training Poly’s AI models, specifically for improving question‑answering accuracy, summarization, and retrieval‑augmented generation (RAG) pipelines in enterprise settings. The company publishes a transparency report (poly.ai/transparency) detailing that no personally identifiable information (PII) is intentionally stored and that all raw data is anonymized before training. Polybot does not sell or share the indexed content with third parties.
⚙️ Rate Limiting Policy
Polybot is rate‑limited to prevent server overload—excessive parallel requests can degrade site performance, so administrators are advised to throttle requests to a maximum of 10 per second per IP if needed. The rationale for threshold‑based blocking is to preserve system stability while still allowing the bot to gather necessary training data at a responsible pace.
Similar Threats
🛡️
Stop Bots. Save Bandwidth. Protect Revenue.
Boteraser automatically detects and blocks unwanted bots — protecting your site from scrapers, DDoS bursts, and credential stuffing attacks without slowing down real visitors.
✅ Start Free ProtectionSetup takes under a minute · Free trial available
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.