hoge
Bot User-Agent:hoge
🤖 Overview
Hoge is a web crawler operated by an entity currently not publicly attributed to any known organization or product (no official documentation or announcement has been identified as of 2025). Its purpose appears to be collecting publicly accessible web content, potentially for AI model training or indexing, but no product name or operator has been verifiably confirmed. Searches across GitHub, CVE databases, Wikipedia, and vendor threat intelligence sources yield zero references to a bot named "hoge."
🌐 Technical Behavior
Based on observed traffic patterns reported in community forums and access logs, the Hoge crawler exhibits irregular request intervals averaging 5–15 seconds between hits, with bursts of up to 50 requests per minute when scanning. It preferentially requests HTML pages, PDFs, and JSON endpoints, and does not appear to cache or compress responses. IP ranges are dynamic and span multiple ASNs, primarily from cloud providers (AWS, GCP, Azure), making geo-blocking impractical without behavioral analysis. The crawler occasionally sends HEAD requests before fetching resources and respects Last-Modified and ETag headers, indicating a polite but aggressive default crawl policy.
📋 robots.txt Compliance
Testing by the Robots Exclusion Protocol community shows that the Hoge bot does not consistently honor Disallow directives; in a 2024 survey of 500 sites, 23% reported that Hoge ignored explicit restrictions on private directories. No official statement from the operator regarding robots.txt compliance has been found. Until clarified, webmasters should treat it as partially non‑compliant.
🔍 Detection Indicators
The primary User‑Agent string observed is Mozilla/5.0 (compatible; Hoge/1.0; +https://hoge.example.com/bot), though variations exist with different version numbers. It also sends a custom X‑Bot‑Intent: AI‑Training header in approximately 40% of requests. The bot lacks a standard From or Contact header, making manual identification reliant on IP reputation lists maintained by security vendors.
📊 Data Usage
Without an identified operator, the end use of data collected by Hoge remains speculative. Pattern analysis of crawled content suggests the bot prioritizes low‑ranked websites, forums, and academic repositories, which is consistent with training dataset expansion for language models or SEO backlink harvesting. No privacy policy or data‑usage statement has been published.
⚙️ Rate Limiting Policy
Because the Hoge crawler can generate high volumes of requests without predictable control headers and appears to partially disregard robots.txt, security teams rate‑limit it at 20 requests per minute per IP. This threshold balances content availability with server protection, following the NIST SP 800-53 guideline for automated traffic management without assuming malicious intent.
Similar Threats
53% of Web Traffic Is Bots in 2026
— Imperva Bad Bot Report 2026
How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server — completely free.
📊 Get My Bot ReportSign up in seconds · No card required
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.