semisearch Bot — Detection, Blocking & Technical Analysis

semisearch

Search Engine User-Agent: semisearch

🤖 Overview

Semisearch is a web crawler operated by the Allen Institute for Artificial Intelligence (AI2), specifically designed to support Semantic Scholar, a free academic search engine that indexes over 200 million research papers and 200 million citations. The bot's primary purpose is to discover and fetch publicly available scholarly content, including PDFs, metadata, and citation graphs, which feed into AI2’s machine learning models for natural language processing and scientific literature analysis. It was first publicly documented in 2015 and has since become a key agent for updating Semantic Scholar’s corpus.

🌐 Technical Behavior

Semisearch employs a multi-threaded asynchronous crawling architecture, typically sending requests at a rate of 20 to 50 requests per second with randomized inter-request delays to minimize server strain. It uses HTTP/1.1 and HTTPS exclusively, defaulting to a 256 KB download buffer for PDF files. The crawler’s IP addresses are drawn from cloud providers such as AWS (54.xx.xx.xx and 52.xx.xx.xx ranges) and Google Cloud Platform (35.xx.xx.xx), though no fixed IP list is published. It follows a breadth-first crawl strategy, prioritizing paper landing pages and direct PDF links, and respects If-Modified-Since and ETag headers to avoid re-downloading unchanged content. The bot also sends a From header ([email protected]) and identifies itself via the User-Agent string SemanticScholarBot/2.0 or, less commonly, the alias Semisearch in server logs. It defaults to a 30-second timeout and retries failed requests up to three times.

📋 robots.txt Compliance

According to official documentation at Semantic Scholar Crawler Page, Semisearch fully adheres to the Robots Exclusion Standard. It checks the robots.txt file at the root of each domain, respecting both the SemanticScholarBot user-agent directive and the generic * rule. The bot will honour a Crawl-Delay directive if present and will cease crawling any path explicitly disallowed (e.g., Disallow: /private/). Site owners can block the bot entirely by disallowing the root path or by adding a specific entry for SemanticScholarBot.

🔍 Detection Indicators

The primary User-Agent string is SemanticScholarBot/2.0 (https://www.semanticscholar.org/), but older versions may appear as SemanticScholarBot/1.0 or simply SemanticScholarBot. In certain log analyses, the string Semisearch is also observed as an alias. Additional fingerprinting clues include the From header set to [email protected], an Accept header of text/html,application/xhtml+xml,application/pdf;q=0.9,*/*;q=0.8, and a distinct pattern of requesting academic URLs (e.g., /pdf/ or /papers/) in quick succession. The bot does not support JavaScript rendering and sends requests only via GET method.

📊 Data Usage

Data collected by Semisearch is exclusively used to enhance the Semantic Scholar database, which powers AI-driven features such as citation graphs, topic modeling, and paper recommendations. The indexed content also feeds into AI2’s training pipelines for advanced NLP models, including large language models tuned for scientific question answering and summarisation. No personally identifiable information is intentionally harvested; only publicly accessible scholarly material is processed under the project’s non‑commercial research goals.

⚙️ Rate Limiting Policy

Semisearch is rate‑limited because its high request volume, especially on large publisher sites, can overwhelm server resources. Administrators are advised to enforce threshold‑based blocking after a sustained rate exceeding 10 requests per second, while allowing a lower baseline to accommodate legitimate indexing.

Similar Threats

Free Bot Analysis

Is Your Site Under Bot Attack Right Now?

Find out exactly how much of your traffic is automated — and which bots are draining your bandwidth and skewing your analytics.

Run Free Bot Scan →

No credit card required · Results in minutes

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.

semisearch

🤖 Overview

🌐 Technical Behavior

📋 robots.txt Compliance

🔍 Detection Indicators

📊 Data Usage

⚙️ Rate Limiting Policy

Is Your Site Under Bot Attack Right Now?

Company

Resources

Services

Trusted

Subscribe