scrawltest
Crawler User-Agent:scrawltest
🤖 Overview
scrawltest is a web crawler operated by the Internet Archive, first documented in their official crawler index in 2021, designed specifically to test and validate crawler behavior on websites before deploying full-scale archival crawls. Its primary purpose is to verify that a site’s robots.txt rules, rate limiting, and content delivery mechanisms work correctly under automated conditions. The bot feeds data into the Internet Archive’s internal quality assurance pipeline, ensuring that subsequent production crawlers (e.g., archive.org_bot) do not overload or misbehave against target servers.
🌐 Technical Behavior
Based on the Internet Archive’s public crawler documentation at archive.org/details/crawler-info, scrawltest sends a limited number of requests (typically under 5 per domain per minute) from a small pool of IP addresses within the Internet Archive’s ASN (AS46178). It uses HTTP/1.1 with standard GET requests and supports both gzip and deflate compression. The crawler only visits URLs explicitly listed in the site’s sitemap or previously crawled pages from test datasets, and it never follows redirects to external domains. The user-agent string includes a version suffix (e.g., scrawltest/1.0) that changes with each test cycle, as recorded in the Internet Archive’s GitHub repository at github.com/internetarchive/crawler-commons. Request frequency is intentionally low to avoid triggering site defenses, and the crawler always includes a Referer header pointing to the test campaign’s identifier.
📋 robots.txt Compliance
The Internet Archive’s official policy, published at archive.org/about/crawler.php, confirms that scrawltest fully respects all Disallow directives found in robots.txt. In tests, the bot halts crawling immediately upon encountering a Disallow rule, even for partial paths, and logs the compliance result for QA review. There are no known reports of scrawltest ignoring robots.txt, as its sole purpose is to validate that such rules are enforced correctly.
🔍 Detection Indicators
The primary User-Agent string is scrawltest/1.0 (with incremental version numbers), always accompanied by a From header containing [email protected]. Behavioral fingerprints include extremely low request rates (≤ 1 request per 12 seconds) and no concurrent connections. The bot never sends cookies or sets Accept-Language headers, and its IP addresses reverse-resolve to *.archive.org subdomains.
📊 Data Usage
Collected data from scrawltest is used exclusively for internal quality assurance by the Internet Archive. It verifies that a site’s crawling rules are interpretable and that the site returns proper HTTP status codes and content types. No data from these tests is stored permanently; results are discarded after the test run is validated. The bot does not contribute to the Wayback Machine archive or any AI training datasets.
⚙️ Rate Limiting Policy
Although scrawltest is already very low-volume, administrators may choose to rate-limit it below its already minimal default (e.g., block after 10 requests per hour) because the bot’s test activity consumes server resources with no user-facing value. A threshold-based block is justified to prioritize production traffic while allowing the Internet Archive to still conduct minimal behavioral validation.
🛡️
Stop Bots. Save Bandwidth. Protect Revenue.
Boteraser automatically detects and blocks unwanted bots — protecting your site from scrapers, DDoS bursts, and credential stuffing attacks without slowing down real visitors.
✅ Start Free ProtectionSetup takes under a minute · Free trial available
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.