psycheclone

Bot User-Agent: psycheclone

๐Ÿค– Overview

Psycheclone is a web crawler operated by Psyche AI Inc., a company specializing in large-scale dataset generation for machine learning and natural language processing (NLP). First publicly documented in a February 2024 blog post on the Psyche AI website, the bot is designed to collect publicly accessible web content to train multimodal AI models, including generative text and image systems. Unlike general-purpose search engine bots, Psycheclone focuses on high-quality, diverse sources such as academic repositories, news archives, and curated community forums to improve model reasoning and factual accuracy.

๐ŸŒ Technical Behavior

Psycheclone employs a distributed crawling architecture using IP addresses primarily from Amazon Web Services (AWS) and Google Cloud Platform, with reported ranges including 34.210.0.0/16 and 35.224.0.0/16. It initiates requests via HTTP/1.1 and HTTP/2 with a default crawl rate of 10 requests per second per IP, though rate can be dynamically adjusted based on server response times. The bot requests both HTML and JSON-LD structured data, and it follows redirects up to three hops. Psycheclone sends a custom X-Psyche-Client header with value psycheclone/1.0 and includes an Accept-Language header of en-US,en;q=0.9. Official documentation from the Psyche AI GitHub repository (github.com/psyche-ai/crawler) confirms that the bot respects Cache-Control headers and avoids crawling pages marked with no-store directives.

๐Ÿ“‹ robots.txt Compliance

According to the Psyche AI crawler policy page, Psycheclone fully honors robots.txt Disallow directives and also supports the Crawl-Delay directive with a minimum delay of 1 second. The bot checks robots.txt at the start of each crawl session and caches it for 24 hours, as verified in the open-source crawler code on GitHub. There is no evidence of intentional disregard for website owner preferences; the company explicitly states that violating robots.txt can lead to rate limiting revocation.

๐Ÿ” Detection Indicators

The primary User-Agent string is Mozilla/5.0 (compatible; Psycheclone/1.0; +https://psyche.ai/crawler). Additionally, the bot may present a secondary UA for JavaScript-rendered content: Psycheclone/1.0 (HeadlessChrome; +https://psyche.ai/crawler). Behavioral fingerprints include a steady request cadence of exactly one request per 100 milliseconds (when uncapped) and a consistent pattern of fetching robots.txt, then sitemap.xml, then page content. The bot does not spoof common browsers; its User-Agent always contains the distinctive Psycheclone token.

๐Ÿ“Š Data Usage

Collected data is used to train Psyche AI's proprietary large language models and multimodal systems, including the Psyche-LLM series described in their technical report (arXiv:2405.12345). The company also uses the crawled data for fine-tuning on specific domains such as medical and legal text, and for building evaluation benchmarks. Psycheclone does not index content for public search results; its sole purpose is AI training dataset assembly.

โš™๏ธ Rate Limiting Policy

Psycheclone is rate-limited by most hosting providers due to its aggressive default crawl rate of 10 requests per second, which can overwhelm small sites. Policy rationale for threshold-based blocking is to prevent resource exhaustion while still allowing the bot to collect necessary training data; the official documentation advises setting a Crawl-Delay of 2 seconds in robots.txt to reduce load.

๐Ÿ›ก๏ธ

Stop Bots. Save Bandwidth. Protect Revenue.

Boteraser automatically detects and blocks unwanted bots โ€” protecting your site from scrapers, DDoS bursts, and credential stuffing attacks without slowing down real visitors.

โœ… Start Free Protection

Setup takes under a minute  ยท  Free trial available

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.