[NC] Bot — Detection, Blocking & Technical Analysis

[NC]

Bot User-Agent: nc

🤖 Overview

NCBI E‑utilities Crawler is an automated agent operated by the National Center for Biotechnology Information (NCBI), a division of the U.S. National Library of Medicine at the National Institutes of Health. Its primary purpose is to programmatically retrieve biomedical literature, gene sequences, and genomic data from NCBI databases such as PubMed, GenBank, and BLAST. This bot enables researchers and third‑party tools (e.g., Galaxy, Taverna) to access NCBI resources via the Entrez Programming Utilities (E‑utilities) API, which uses a specific URL pattern (https://eutils.ncbi.nlm.nih.gov/entrez/eutils/) rather than traditional web crawling.

🌐 Technical Behavior

The NCBI E‑utilities Crawler does not traverse web pages like a general‑purpose search engine bot; instead it sends structured HTTP GET or POST requests to a fixed set of API endpoints. Official documentation specifies that requests must include an email parameter for user contact, and the bot will refuse service if this is omitted. The NCBI enforces a strict rate limit of 3 requests per second for users without an API key, and 10 requests per second with a valid key. IP addresses originate from the NCBI’s own subnets (e.g., 130.14.0.0/16) and are geo‑located within the United States. The crawler uses HTTPS exclusively and follows redirects only for canonical URLs. It does not support cookies or JavaScript; all interaction is stateless.

📋 robots.txt Compliance

The NCBI E‑utilities bot does not crawl public websites directly; it targets only NCBI‑controlled domains (ncbi.nlm.nih.gov, pubmed.ncbi.nlm.nih.gov). Therefore, robots.txt directives on external sites are irrelevant. Within NCBI’s own servers, the E‑utilities interface is governed by API terms of service rather than robots.txt. However, documented evidence shows that NCBI’s internal crawler for indexing (e.g., the “NCBI Bot” that harvests full‑text articles from PMC) does honor Disallow rules on publisher sites when accessing via the PMC OAI service.

🔍 Detection Indicators

Requests from the NCBI E‑utilities bot are identified by the presence of the email query parameter and the tool parameter (e.g., tool=myapp). The standard User‑Agent string is NCBI/1.0 (https://www.ncbi.nlm.nih.gov/entrez/) or variations like Entrez/1.0. Behavioral fingerprints include extremely high request frequency to a single host, no referrer header, and an accept header that prefers XML or JSON (e.g., application/xml). Absence of standard web‑browser signatures (Accept‑Language, User‑Agent begins with “Mozilla”) is a strong indicator.

📊 Data Usage

Collected data—titles, abstracts, gene sequences, and citation metadata—is aggregated into NCBI’s public databases used for AI training (e.g., biomedical language models like PubMedBERT), search indexing (PubMed search engine), and analytics (e.g., literature mining for drug discovery). NCBI also provides the data to third‑party researchers under open‑access policies, enabling systematic reviews and meta‑analyses. No personally identifiable information is collected; all data is de‑identified or already public.

⚙️ Rate Limiting Policy

Rate limiting is enforced because the bot’s automated queries can overwhelm shared API endpoints if unrestricted; NCBI’s policy caps requests at 3/sec (no key) and 10/sec (with key) to ensure fair access for all users and prevent denial of service. Threshold‑based blocking is necessary to protect the infrastructure from unintentional abuse by misconfigured scripts or by users who exceed the agreed‑upon limits without an API key.

Similar Threats

53% of Web Traffic Is Bots in 2026

— Imperva Bad Bot Report 2026

How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server — completely free.

📊 Get My Bot Report

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.

[NC]

🤖 Overview

🌐 Technical Behavior

📋 robots.txt Compliance

🔍 Detection Indicators

📊 Data Usage

⚙️ Rate Limiting Policy

53% of Web Traffic Is Bots in 2026

Company

Resources

Services

Trusted

Subscribe