DigitalPebble

Bot User-Agent: digitalpebble

🤖 Overview

DigitalPebble is an open-source web crawler framework originally developed by Emmanuel Keller and maintained on GitHub (https://github.com/emmanuel-keller/DigitalPebble). It is designed to build custom search engines and data extraction pipelines, enabling researchers and developers to crawl websites for indexing, analytics, and archival purposes. Unlike commercial bots, it is a configurable tool that users deploy locally or on cloud infrastructure, not a single centralized crawler.

🌐 Technical Behavior

The crawler operates as a multithreaded process using HTTP 1.1 and respects robots.txt directives by default. It supports both breadth-first and depth-first crawl strategies, with configurable crawl depth limits (default: 2) and page count caps (commonly 1000 pages per domain). The default request rate is one request per second per domain, but this can be adjusted via the CrawlConfig.xml file. IP ranges vary depending on the deployment environment — there is no fixed set of IPs; typical installations use cloud provider addresses (AWS, GCP) or university networks. The bot sends the User-Agent header DigitalPebble/1.0 or DigitalPebble/2.0 (version‑dependent) and includes an Accept: text/html,application/xhtml+xml header.

📋 robots.txt Compliance

According to the official documentation on GitHub, DigitalPebble always checks robots.txt before fetching a page, using a built-in RobotRulesParser that implements the Robots Exclusion Standard (RFC 9309). It respects both Disallow and Crawl-delay directives; if a Crawl-delay is specified, it overrides the default rate limit. Verified by community reports and the source code, the crawler does not override user‑configured exclusions.

🔍 Detection Indicators

The primary User‑Agent string is DigitalPebble/1.0 (or DigitalPebble/2.0 for newer releases). Additional fingerprints include the X-Robots-Tag header sent in HTTP responses, which the crawler logs but does not always act upon — it relies solely on robots.txt. The request also includes a Referer field set to the seed URL and an Accept-Language of en-US,en;q=0.5.

📊 Data Usage

Collected content is stored locally in a Solr or Elasticsearch index for full‑text search, or exported as CSV/JSON for analysis. Typical use cases include building domain‑specific indexes (e.g., academic literature, legal documents), conducting web‑scale research, or training small‑scale machine learning models. The data is not sold or shared with third parties — the crawler is a tool, not a service.

⚙️ Rate Limiting Policy

DigitalPebble is rate‑limited because its flexible configuration can generate aggressive crawl patterns when misconfigured. Administrators should enforce per‑IP thresholds (e.g., 10 requests per second) using a WAF or rate limiter, and apply a Crawl-delay in robots.txt to slow it down — the bot will obey such directives, making blocking unnecessary except for abusive instances.

Free Bot Analysis

Is Your Site Under Bot Attack Right Now?

Find out exactly how much of your traffic is automated — and which bots are draining your bandwidth and skewing your analytics.

Run Free Bot Scan →

No credit card required  ·  Results in minutes

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.