grub

Bot User-Agent: grub

🤖 Overview

Grub is a distributed web crawler originally developed by LookSmart and later maintained by Yahoo as part of its search infrastructure. The project, documented at grub.org (now archived), aimed to leverage a peer-to-peer network of client machines to crawl the web in a decentralized manner, feeding results into a central index for search engine use. Unlike traditional single-server crawlers, Grub’s architecture allowed thousands of volunteers to contribute crawling capacity, making it one of the earliest large-scale distributed crawling systems.

🌐 Technical Behavior

Grub operates via a client-server protocol: the central server distributes URL lists to client software installed on volunteer machines, which then fetch and parse pages, reporting back extracted links and metadata. The official Grub client, available as open source on SourceForge (project grub), uses HTTP/1.1 with persistent connections and respects robots.txt instructions. Crawl frequency per client is moderate — typically a few requests per minute — but the aggregate across thousands of clients can produce thousands of requests per second. IP ranges are highly distributed because each client uses its own residential or organizational IP address, making rate limiting on a per‑IP basis impractical. The crawler supports HTTP and HTTPS, and sends Accept-Encoding: gzip, deflate headers to reduce bandwidth.

📋 robots.txt Compliance

According to the Grob documentation on grub.org and archived project pages, the client software reads and obeys robots.txt directives for each domain. It does not crawl Disallowed paths and respects crawl-delay instructions. However, because the crawler is distributed, individual clients may honor the delay inconsistently if the server’s crawl-delay is low or absent. Overall, Grub is considered compliant with the Robots Exclusion Protocol as documented by the project’s official robots.txt handling code.

🔍 Detection Indicators

The primary User-Agent string for Grub is "Grub Client-1.0" or "Grub/1.0 (compatible; +http://www.grub.org)". Some versions also send "Grub/0.3.0". Behavioral fingerprinting reveals a pattern of low request rate per IP but high link‑fetcher parallelism. It does not embed specific session tokens in headers. Logs may show repeated requests from different IPs with identical user agents and similar request timing, indicating the distributed nature.

📊 Data Usage

The data collected by Grub is used exclusively for web search indexing. LookSmart and later Yahoo employed the crawler to build and refresh their search engine indexes. The project also published raw crawl data for academic research, contributing to datasets like the ClueWeb09 collection. No AI training or advertising analytics were associated with Grub beyond search index generation.

⚙️ Rate Limiting Policy

Because Grub uses many distributed IPs, per‑IP rate limits are ineffective; instead, rate limiting should be applied based on request patterns matching the Grub Client User‑Agent strings. Administrators are advised to set a generous overall request cap (e.g., 200 requests per minute) for this User‑Agent, and if excessive bandwidth is consumed, return 429 Too Many Requests after exhausting a token bucket of 500 requests per hour.

53% of Web Traffic Is Bots in 2026

— Imperva Bad Bot Report 2026

How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server — completely free.

📊 Get My Bot Report

Sign up in seconds  ·  No card required

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.