gurujibot

Bot User-Agent: gurujibot

🤖 Overview

Gurujibot is the official web crawler operated by Guru, the enterprise knowledge management platform headquartered in Philadelphia, Pennsylvania. First deployed in 2019, its primary purpose is to index publicly available web content from customer‑specified sources (e.g., internal wikis, help centers, public documentation) to feed Guru’s “Collect” feature, which ingests structured and unstructured data into a centralized knowledge base for AI‑powered search and answer generation. The bot is documented on Guru’s official help center as part of their content ingestion pipeline, and it is explicitly listed in the company’s robots.txt directive guidelines.

🌐 Technical Behavior

Gurujibot crawls using HTTP/1.1 and HTTPS protocols, issuing GET requests at a configurable rate that defaults to 1 request per second per source domain, with bursts of up to 5 requests in quick succession allowed to reduce crawl latency. It supports If‑Modified‑Since and ETag headers to skip unchanged content, as detailed in Guru’s “Content Ingestion” knowledge base article. The bot originates from a fixed set of IP ranges that Guru publishes on their status page and in their documentation: currently 34.227.0.0/16 and 44.198.0.0/16 (both AWS us-east-1). It respects the Referer header and includes a From header with the email address [email protected] for contact, as confirmed by a 2024 update to their user‑agent policy.

📋 robots.txt Compliance

According to Guru’s official documentation published at help.getguru.com, Gurujibot fully honors Disallow directives in robots.txt. The bot will not crawl any path or directory explicitly excluded, and it checks for Crawl‑Delay directives to slow its rate accordingly. Guru advises that administrators can block the bot entirely by adding User‑agent: Gurujibot followed by Disallow: / to their robots.txt.

🔍 Detection Indicators

The primary User‑Agent string is Gurujibot/2.0 (sometimes seen as gurujibot/2.0). A secondary string with a version suffix (e.g., Gurujibot/2.1) appears for internal testing. The bot also sends a User‑Agent header that includes the platform identifier Mozilla/5.0 (compatible; Gurujibot/2.0; +https://www.guru.com/crawler). The presence of the email From: [email protected] header is a reliable detection fingerprint. No known CVEs have been reported for this crawler; its behavior is consistent and well‑documented.

📊 Data Usage

Data collected by Gurujibot is used exclusively for indexing within the Guru platform. This includes building a searchable knowledge graph that powers the Guru AI answer engine, which returns responses to user queries based on ingested documentation. The content is not used for training external language models or sold to third parties, as explicitly stated in Guru’s privacy policy and data processing agreement (DPA).

⚙️ Rate Limiting Policy

While Gurujibot is legitimate and compliant, it can be aggressive when indexing large sites with many linked pages (e.g., thousands of documentation articles). Rate‑limiting to a threshold of 10 requests per second per IP is recommended to prevent unnecessary load on origin servers while still allowing the bot to complete its crawl in a reasonable timeframe. This policy balances the need for timely content freshness with server resource preservation.

53% of Web Traffic Is Bots in 2026

— Imperva Bad Bot Report 2026

How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server — completely free.

📊 Get My Bot Report

Sign up in seconds  ·  No card required

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.