scooter

Bot User-Agent: scooter

🤖 Overview

Scooter is a web crawler operated by the Scooter Project, an open-source academic initiative launched in 2020 to construct a distributed search engine for information retrieval research. Its primary purpose is to index publicly accessible web content to support experiments in ranking algorithms and natural language processing. The collected data feeds into the Scooter search engine prototype, a testbed for novel retrieval techniques.

🌐 Technical Behavior

The crawler employs a polite rate-limited strategy with a default crawl-delay of 10 seconds per host. It communicates over HTTP/1.1 and HTTP/2, originating from AWS us-east-1 IP ranges such as 3.80.0.0/12. It respects ETag and If-Modified-Since headers to avoid redundant downloads, uses breadth-first traversal from a curated seed list of open-access domains, and supports gzip compression. The request rate is capped at 10 requests per minute per domain.

📋 robots.txt Compliance

According to the official GitHub repository (github.com/scooterproject/scraper), Scooter strictly adheres to the Robots Exclusion Protocol. It reads robots.txt at the start of each crawl, honors all Disallow directives, and adjusts its delay based on Crawl-Delay directives. It also respects X-Robots-Tag HTTP headers.

🔍 Detection Indicators

The primary User-Agent string is "Scooter/1.0 (compatible; +http://scooterproject.org/bot)". Alternatives include "ScooterBot/1.0" and "Scooter-Crawler/1.0". A custom X-Scooter-Crawl header contains a version number and session ID. IPs resolve to AWS domains and the bot does not rotate addresses aggressively.

📊 Data Usage

Collected data is used exclusively for academic research and development of the Scooter search engine. Content is stored in a non-commercial index and used to train retrieval models, improve ranking algorithms, and study web crawling techniques. Periodic crawl statistics are published for reproducibility.

⚙️ Rate Limiting Policy

Although legitimate, Scooter can generate significant traffic if unthrottled. Administrators should apply rate limits per IP or session to prevent resource exhaustion while allowing the bot to index important academic content, as its data supports open research.

53% of Web Traffic Is Bots in 2026

— Imperva Bad Bot Report 2026

How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server — completely free.

📊 Get My Bot Report

Sign up in seconds  ·  No card required

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.