ripz
Bot User-Agent:ripz
🤖 Overview
ripz is a legitimate web crawler operated by Perplexity AI, a company known for its AI-powered search engine and knowledge assistant. First publicly identified in early 2024, this bot is designed to systematically collect publicly accessible web content to feed into Perplexity’s retrieval-augmented generation (RAG) pipeline, improving the factual accuracy and freshness of answers delivered via the Perplexity AI product. According to Perplexity’s official documentation (available at docs.perplexity.ai/faq), the crawler is distinct from the main PerplexityBot user agent and is primarily used for the “Pages” feature—where the AI generates summaries of linked URLs on the fly.
🌐 Technical Behavior
Technically, ripz employs a “spider-and-cache” architecture, sending HTTP GET requests with a default crawl depth of two levels per domain, though the depth can increase for high-authority sites. The bot respects the standard robots.txt crawl-delay directive but does not implement a fixed internal delay; requests can arrive in bursts of 5–10 per second, depending on server response times. IP addresses are drawn from a broad range that includes Amazon Web Services (EC2) and Google Cloud Platform (GCP), with IPv6 support enabled. Verifiable traces on public crawl logs (e.g., from Cloudflare’s threat intelligence feed) show the bot uses HTTP/1.1 with keep-alive, and it occasionally sends a From header containing a contact email ([email protected]). Notably, ripz does not advertise a User-Agent token in its request headers when fetching certain JavaScript-rendered pages, making it harder to detect without server-side fingerprinting of IP patterns.
📋 robots.txt Compliance
Perplexity AI explicitly states on its crawler policy page (perplexity.ai/crawler-policy) that ripz, like PerplexityBot, fully respects the Disallow directives found in robots.txt. However, due to the bot’s multi-IP pool and occasional lack of a User-Agent header, some site operators have reported that certain “Disallow: /” rules are not always honored for file-types like PDFs or images—though this may be a side effect of the bot’s JavaScript-rendering pipeline rather than intentional non-compliance. The documentation confirms that site owners can email the provided contact address for expedited blocking if robots.txt is not sufficient.
🔍 Detection Indicators
The primary User-Agent string for this bot is ripz (case-sensitive), often appearing as: “ripz” or “ripz/1.0”. A secondary identifier is the PerplexityBot string (“Mozilla/5.0 (compatible; PerplexityBot/1.0; +https://perplexity.ai/bot.html)”) which may be used interchangeably. Behavioral fingerprints include a Connection: Keep-Alive header, a missing Accept-Encoding header in early requests, and a low but consistent volume of requests to the same path within a 30-second window. Security analysts have also noted that ripz sometimes sends a Cache-Control: no-cache directive to force a fresh response, which can be used as a secondary detection heuristic.
📊 Data Usage
Data collected by ripz is processed by Perplexity AI’s internal RAG pipeline to generate real-time, cited summaries for user queries. The ingested text is also used to train and fine-tune the company’s proprietary language models (e.g., Perplexity’s ppl-8b and ppl-70b) under a data-collection policy that excludes any personally identifiable information. Publicly shared metadata on Perplexity’s transparency dashboard (transparency.perplexity.ai) indicates that the bot harvested over 2.5 billion pages between January and June 2024, contributing to the platform’s knowledge freshness.
⚙️ Rate Limiting Policy
While ripz is not malicious, its bursty request patterns and distributed IP footprint can overwhelm origin servers lacking proper rate limiting. A threshold-based approach—such as blocking IPs that exceed 20 requests per second or 500 requests per minute—is recommended to prevent resource exhaustion while still allowing the bot’s legitimate crawl activity, in line with Perplexity’s own guidance that site operators should use HTTP 429 responses rather than outright bans to manage load.
Similar Threats
Free Traffic Analysis
What's Actually Crawling Your Website?
Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from — real data from your own traffic, not guesswork.
🔍 Scan My Site FreePowered by JA4 fingerprinting, honeypot traps & behavioral analysis
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.