GeistHaus-PageFetcher Bot — Detection, Blocking & Technical Analysis

GeistHaus-PageFetcher

Bot User-Agent: geisthaus-pagefetcher

🤖 Overview

GeistHaus-PageFetcher is a web crawler operated by GeistHaus GmbH, a German artificial intelligence company headquartered in Berlin. First publicly documented in March 2023, the crawler collects publicly accessible web pages to train the company’s proprietary large language models and to power their content generation suite, GeistHaus AI Platform. The platform offers automated writing, summarization, and translation services. Official operational details are published at https://geisthaus.com/crawler and a dedicated GitHub repository (https://github.com/geisthaus/crawler-policy) provides robots.txt guidance.

🌐 Technical Behavior

GeistHaus-PageFetcher performs HTTP GET and HEAD requests at a rate of 5–15 requests per second per source IP, with bursts as high as 30 under low load. It uses HTTP/1.1 and HTTP/2 and attempts to fetch HTML, text, and well-structured data formats (JSON, XML) while ignoring binary resources like audio or video. The crawler employs a distributed proxy pool using IPs from Amazon Web Services, Google Cloud, and Hetzner, cycling addresses to avoid triggering single‑source throttles. It adheres to the Crawl-Delay directive in robots.txt but does not enforce a hard minimum delay beyond that. According to GeistHaus’s public crawl policy, the bot supports the If-Modified-Since header to reduce data transfer for unchanged pages.

📋 robots.txt Compliance

GeistHaus officially states that GeistHaus-PageFetcher “honours all Disallow directives in robots.txt”. The company maintains a public robots.txt exclusion list at https://geisthaus.com/robots.txt where webmasters can specify paths to block. Independent testing by Cloudflare Bot Management has confirmed that the crawler respects newly updated robots.txt files within 10–20 minutes, as verified through access logs and server analytics.

🔍 Detection Indicators

The definitive User‑Agent string is GeistHaus-PageFetcher/1.0 (compatible; GeistHaus; +https://geisthaus.com/crawler). Older versions may lack the version number, while newer variants include a build‑timestamp suffix. The crawler always sends a From header containing [email protected]. Reverse DNS lookups on its source IPs frequently return PTR records ending in .crawler.geisthaus.com. It does not use any common search‑engine bot strings.

📊 Data Usage

Collected web pages are parsed, cleaned, and ingested into GeistHaus’s training pipeline for their GPT‑class language models. The data improves the models’ accuracy in generating context‑aware text, performing abstractive summarization, and answering domain‑specific queries. GeistHaus states that raw page content is not shared with third parties; only derived model weights and aggregated statistics are used in their commercial products.

⚙️ Rate Limiting Policy

Because GeistHaus-PageFetcher can generate tens of thousands of requests per day from a distributed set of IPs, it is rate‑limited to preserve server stability and prevent resource exhaustion. A typical threshold blocks any source IP that exceeds 20 requests per second over a rolling 30‑second window, as recommended by the GeistHaus documentation itself to avoid unnecessary contention.

Similar Threats

Free Bot Analysis

Is Your Site Under Bot Attack Right Now?

Find out exactly how much of your traffic is automated — and which bots are draining your bandwidth and skewing your analytics.

Run Free Bot Scan →

No credit card required · Results in minutes

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.