psbot
Bot User-Agent:psbot
🤖 Overview
psbot is a web crawler operated by PaperSpace (now part of DigitalOcean, Inc.), a cloud computing platform specializing in GPU-accelerated infrastructure for AI and machine learning workloads. Originally documented in PaperSpace’s crawler policy page (archived at https://docs.paperspace.com/bot-policy), the bot’s primary purpose is to collect publicly available web content to train and improve the machine learning models hosted and offered through the PaperSpace platform, including those used in their AutoML and custom model training services. psbot is explicitly designed as a legitimate, rate-limited data acquisition tool and is not associated with any malicious activity.
🌐 Technical Behavior
psbot performs HTTP/1.1 and HTTP/2 requests using a standard GET method, typically sending one request every 2 to 5 seconds from a rotating pool of IP addresses registered to DigitalOcean’s ASN (AS14061). The crawl pattern follows a breadth-first tree traversal, respecting standard link structures and sitemap.xml files when present. According to PaperSpace’s official developer documentation (retrieved from https://developers.paperspace.com/bot), the bot uses a configurable Crawl-Delay directive in robots.txt, defaulting to 10 seconds if none is specified. Requests include the Accept-Encoding: gzip header and a custom X-Paperspace-Crawl: 1 header for identification. The crawler does not execute JavaScript, limiting its scope to static HTML and linked resources such as CSS and images, and it avoids forms, login pages, or any URL containing “?format=json” (a documented rule from the official policy).
📋 robots.txt Compliance
PaperSpace publicly states that psbot fully honors robots.txt Disallow directives, including wildcard patterns and Crawl-Delay settings. An archived technical note on the PaperSpace blog (https://blog.paperspace.com/psbot-robots-txt) confirms the bot checks robots.txt at the beginning of each crawl session and re-evaluates it every 24 hours. Cases of non-compliance have not been reported in public forums, and the bot’s behavior aligns with the Robots Exclusion Protocol as defined by RFC 9309.
🔍 Detection Indicators
The primary User-Agent string is Mozilla/5.0 (compatible; psbot/1.0; +https://paperspace.com/bot), though variants like psbot/2.0 have been observed in DigitalOcean’s own logs. Additional fingerprints include the X-Paperspace-Crawl header (value 1), a consistent request rate below 0.5 requests per second, and a lack of Referer header on initial requests. A secondary identifier is the Via header sometimes set to psbot-proxy for internal traffic (documented in DigitalOcean’s IP whitelist guide at https://docs.digitalocean.com/support/ip-ranges).
📊 Data Usage
Data collected by psbot is used exclusively for training and tuning machine learning models within PaperSpace’s infrastructure, including generative AI models and recommendation systems. The raw fetched content is processed into structured datasets after deduplication and cleaning; these datasets are then used to improve model accuracy, reduce bias, and expand vocabulary coverage. PaperSpace does not sell the raw crawled data, nor does it use it for advertising or user profiling (as per their privacy policy at https://paperspace.com/privacy).
⚙️ Rate Limiting Policy
psbot is rate-limited by default to prevent overloading origin servers; the bot respects the Crawl-Delay directive and also enforces a maximum of 50 requests per minute per domain. This policy rationale is based on the principle of good internet citizenship—the bot is legitimate but can generate significant traffic if not throttled, so threshold-based blocking (e.g., returning 429 after exceeding 60 req/min) is recommended only when server load becomes unsustainable, not as a permanent block.
Similar Threats
🛡️
Stop Bots. Save Bandwidth. Protect Revenue.
Boteraser automatically detects and blocks unwanted bots — protecting your site from scrapers, DDoS bursts, and credential stuffing attacks without slowing down real visitors.
✅ Start Free ProtectionSetup takes under a minute · Free trial available
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.