penthesila
Bot User-Agent:penthesila
🤖 Overview
Penthesila is a web crawler operated by Penthesila Inc., a data analytics company specializing in AI training dataset construction. Its primary purpose is to collect publicly available web content to feed into proprietary large language models and natural language processing pipelines. The crawler was first mentioned in a community forum post in early 2024 and is associated with the Penthesila Data Platform, though official documentation remains sparse.
🌐 Technical Behavior
Penthesila crawls websites using HTTP/1.1 and HTTP/2 protocols, sends requests at a rate of roughly one request every 2–5 seconds per domain, and rotates through a pool of residential and cloud IP addresses primarily in the 185.0.0.0/8 and 45.0.0.0/8 ranges. It respects Connection: keep-alive headers, requests text/html and application/pdf content types, and occasionally sends Accept-Encoding: gzip. The bot follows internal redirects but does not parse JavaScript content, focusing only on static HTML and linked documents.
📋 robots.txt Compliance
Based on observed behavior reported on webmaster forums and a single GitHub issue (https://github.com/penthesila/crawler/issues/4), Penthesila honors Disallow directives in robots.txt approximately 85% of the time. There have been isolated reports of the bot ignoring Crawl-delay instructions, but no systematic non-compliance has been documented in security advisories.
🔍 Detection Indicators
The primary User-Agent string is Mozilla/5.0 (compatible; Penthesila/1.0; +https://penthesila.com/bot). Alternative strings include Penthesila/2.0 and PenthesilaCrawler/1.1. A distinctive HTTP header X-Crawler-Type: penthesila is sometimes present. Behavioral fingerprinting shows frequent requests to /robots.txt before any page crawl.
📊 Data Usage
Collected data is used to train Penthesila’s proprietary language models, improve text summarization algorithms, and build domain-specific knowledge bases for enterprise clients. The company states on its website (https://penthesila.com/privacy) that no personally identifiable information is intentionally harvested, but raw page content is stored indefinitely for model retraining.
⚙️ Rate Limiting Policy
Penthesila is rate‑limited because its moderate crawl speed can still cause load spikes on smaller shared hosting servers. A threshold of 100 requests per hour per IP is recommended by community security guides, with a 24‑hour block after exceeding 500 requests. This policy balances data collection needs with website performance protection.
Similar Threats
⚠️
Your Site May Be Hemorrhaging Revenue to Bots
Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected — completely free.
Check My Site for FreeFree to start · Cancel anytime
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.