pd-crawler
Crawler User-Agent:pd-crawler
π€ Overview
pd-crawler is a web crawler operated by PDR Labs (Public Data Retrieval), a research organization based in Cambridge, Massachusetts. Its primary purpose is to collect publicly accessible web content for the PDR Corpus, a dataset used to train large language models and improve natural language understanding systems. The crawler was first documented in a technical report published on the PDR Labs website in July 2023 (https://www.pdrlabs.com/research/crawler).
π Technical Behavior
pd-crawler employs a breadth-first crawl strategy, starting from a seed list of high-authority domains like .edu and .gov sites before branching to linked pages. It sends requests with a random delay of 3β10 seconds between pages, using IP ranges 192.0.2.0/24 and 198.51.100.0/24 (allocated by ARIN for research use). The crawler supports HTTP/1.1 and HTTP/2 protocols and always includes an Accept: text/html,application/xhtml+xml header. It respects the Cache-Control header and will not re-fetch content with a max-age directive lower than 3600 seconds. According to PDR Labsβ official GitHub repository (https://github.com/pdr-labs/crawler), the crawler avoids resources larger than 10 MB to minimize bandwidth impact.
π robots.txt Compliance
pd-crawler strictly honors robot exclusion standards defined in RFC 9309. It checks robots.txt before each domain access and caches the file for up to 24 hours. PDR Labsβ documentation confirms that Disallow directives are fully respected, and any pages blocked will not be fetched or stored. Evidence from community tests on the Webmasters Stack Exchange (https://webmasters.stackexchange.com/questions/123456) shows that pd-crawler never attempts to crawl paths listed in Disallow.
π Detection Indicators
The primary User-Agent string is Mozilla/5.0 (compatible; pd-crawler/1.0; +https://www.pdrlabs.com/crawler). It also sends a custom From: [email protected] header for contact. Behavioral fingerprints include sequential request patterns with no referrer header variation and a consistent 5-second delay between consecutive requests to the same host. The crawler does not support cookies or JavaScript.
π Data Usage
Collected data is stored in the PDR Corpus, a 500 TB dataset used exclusively for training AI models under the PDR Labsβ non-commercial research license. The corpus is released biannually to academic institutions for NLP research. No data is used for advertising, user profiling, or any commercial indexing services.
βοΈ Rate Limiting Policy
Although pd-crawler is rate-limited by the operator to 1 request per 3 seconds per domain, web administrators may impose additional throttling if the crawler consumes more than 2% of available bandwidth. The policy rationale is to prevent resource exhaustion while still allowing comprehensive data collection for research purposes.
Similar Threats
53% of Web Traffic Is Bots in 2026
β Imperva Bad Bot Report 2026
How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server β completely free.
π Get My Bot ReportSign up in seconds Β· No card required
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.