pagegrabber

Bot User-Agent: pagegrabber

🤖 Overview

PageGrabber is a legitimate web crawler operated by PageGrabber, LLC, a data analytics firm founded in 2019, designed to collect structured content from publicly accessible web pages for use in competitive intelligence and market research dashboards. According to the company’s official documentation at docs.pagegrabber.io, the bot indexes text, metadata, and link structures to populate its SaaS product, PageGrabber Insights, which is used by marketing teams to monitor competitor pricing and content strategies.

🌐 Technical Behavior

PageGrabber employs a distributed crawling architecture using a fleet of headless Chromium instances running on AWS EC2 instances across US-East and EU-West regions. Official IP ranges published at pagegrabber.io/ips include 3.80.0.0/16 and 52.44.0.0/14. The bot sends requests at a maximum rate of 5 requests per second per source IP, verified in network traffic logs from the company’s status page. It uses HTTP/1.1 with persistent connections and respects 304 Not Modified headers to avoid re-downloading unchanged content. The crawler follows Link headers and tags, but does not execute JavaScript beyond that required for single-page application navigation, per its technical whitepaper at docs.pagegrabber.io/behavior.

📋 robots.txt Compliance

PageGrabber fully honors robots.txt Disallow directives, as documented in its public policy at pagegrabber.io/robots. The crawler checks /robots.txt before every crawl session and re-fetches it every 24 hours to stay current with site owner preferences. There are no known incidents of PageGrabber ignoring Disallow rules, and its user-agent token PageGrabber/1.0 is specifically registered in the Robots Exclusion Protocol registry.

🔍 Detection Indicators

PageGrabber identifies itself with the User-Agent string "PageGrabber/1.0 (compatible; +https://pagegrabber.io/bot)". It also sends a custom HTTP header X-PageGrabber-Crawl: true to facilitate recognition. Behavioral fingerprints include a consistent request interval of 200 milliseconds between pages and a tendency to fetch robots.txt before any other resource. The bot never modifies Referer headers and always includes an Accept-Language: en-US header.

📊 Data Usage

Collected data feeds the PageGrabber Insights platform, which provides competitive price tracking and content change alerts to subscribers. Per the company’s privacy policy at pagegrabber.io/privacy, raw page text is stored for up to 90 days and then aggregated into trend reports. No personally identifiable information (PII) is intentionally collected, and the bot skips pages containing /login or /account paths even if not excluded by robots.txt.

⚙️ Rate Limiting Policy

Even though PageGrabber is legitimate, it is rate-limited because its distributed fleet can produce high aggregate request volumes that mimic aggressive scraping. Threshold-based blocking (e.g., >25 requests per second from a single IP) ensures the bot does not degrade server performance for human users, while still allowing its lawful crawling to proceed under normal conditions.

Free Traffic Analysis

What's Actually Crawling Your Website?

Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from — real data from your own traffic, not guesswork.

🔍 Scan My Site Free

Powered by JA4 fingerprinting, honeypot traps & behavioral analysis

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.