sbider

Bot User-Agent: sbider

🤖 Overview

Sbider is a web crawler operated by the Allen Institute for Artificial Intelligence (AI2), first documented in 2021, and is used to aggregate scientific literature and academic content for the Semantic Scholar research platform. Its primary purpose is to index open-access papers, preprints, and other scholarly materials to enhance AI‑powered discovery tools.

🌐 Technical Behavior

Sbider performs crawling using a custom HTTP client with a request rate of approximately 5–10 requests per second per domain, as observed in public crawl logs. It prioritizes PDF and HTML pages from known academic repositories such as arXiv, PubMed Central, and university archives. The bot operates within IP ranges assigned to AI2 (ASN 39888), including addresses in the 45.78.0.0/16 and 207.171.0.0/16 blocks. It supports both IPv4 and IPv6, and uses conditional GET requests with the If-Modified-Since header to minimize redundant downloads. Sbider respects the Crawl-Delay directive but does not wait longer than the specified delay.

📋 robots.txt Compliance

Sbider fully honors robots.txt Disallow directives, as confirmed by AI2’s publicly posted documentation at https://api.semanticscholar.org/crawler. It caches robots.txt for up to 24 hours and re-evaluates the file if a directory is disallowed mid‑crawl.

🔍 Detection Indicators

The primary User‑Agent string for Sbider is "Mozilla/5.0 (compatible; Sbider/1.0; +https://api.semanticscholar.org/crawler)". It also sends the header "From: [email protected]" for contact purposes. Behavioral fingerprints include a consistent concurrent request count of 4 and a default request interval of 200 milliseconds.

📊 Data Usage

Data collected by Sbider is used exclusively to build the Semantic Scholar corpus, which powers citation graph analysis, NLP model training, and the open‑source S2ORC dataset. AI2 publishes transparency reports detailing the crawl footprint and data retention policies.

⚙️ Rate Limiting Policy

Although Sbider is a legitimate scholarly crawler, it can generate sustained traffic during full‑corpus refreshes; hence, webmasters are advised to rate‑limit it at 200 requests per minute per IP to preserve server performance while still allowing necessary academic indexing.

⚠️

Your Site May Be Hemorrhaging Revenue to Bots

Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected — completely free.

Check My Site for Free

Free to start  ·  Cancel anytime

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.