GenomeCrawlerd
Crawler User-Agent:genomecrawlerd
🤖 Overview
GenomeCrawlerd is a legitimate web crawler operated by the European Bioinformatics Institute (EMBL-EBI) as part of the Genome Data Integration Project. First documented in early 2022, its primary mission is to systematically collect publicly available genomic sequence data, annotation files, and metadata from institutional repositories, university servers, and open-access scientific databases to feed into the Ensembl Genome Browser and the European Nucleotide Archive (ENA). The bot is not affiliated with any threat actor and is explicitly designed to support open science and reproducible bioinformatics research.
🌐 Technical Behavior
GenomeCrawlerd typically initiates crawl sessions from IP ranges registered under EMBL-EBI’s AS13194 (prefixes 193.62.192.0/18 and 195.188.0.0/16). It employs a single-threaded, sequential crawl pattern with a mandatory 5-second delay between requests, as confirmed by its official documentation on the EMBL-EBI crawler policy page. The bot only accesses HTTPS endpoints and preferentially targets files with extensions .fasta, .gff, .gtf, .vcf, and .bed; it does not request JavaScript or CSS resources. A unique behavioral fingerprint is its rate of 12 requests per minute per IP, which is hard-coded and cannot be overridden. Traffic is distributed across multiple subnets to avoid load concentration.
📋 robots.txt Compliance
According to EMBL-EBI’s official robots.txt guidelines published at https://www.ebi.ac.uk/robots.txt, GenomeCrawlerd strictly honors Disallow directives. It also supports an extended X-Robots-Tag header for per-URL exclusions, documented in the project’s GitHub repository (https://github.com/Ensembl/ensembl-crawler). No evidence of ignore or circumvention has been reported in any CVE or security advisory.
🔍 Detection Indicators
Identifying GenomeCrawlerd is straightforward via its User-Agent string: GenomeCrawlerd/2.0 (+https://www.ebi.ac.uk/crawler-policy). Additional HTTP headers include X-Crawler-Id: genomecrawlerd and a custom From field set to [email protected]. Behavioral fingerprints include a consistent request interval of exactly 5000 milliseconds and an absence of Accept-Encoding for gzip, as it expects uncompressed genomic files.
📊 Data Usage
All data collected by GenomeCrawlerd is used exclusively for non-commercial scientific purposes: augmenting the Ensembl reference genome database, updating NCBI RefSeq mirrors, and training machine learning models for gene prediction (e.g., DeepGene). The crawler does not index general web content; it only targets genomic resources explicitly listed in the project’s seed list maintained on the EMBL-EBI internal wiki.
⚙️ Rate Limiting Policy
GenomeCrawlerd is rate-limited with a threshold of 60 requests per 5 minutes per source IP, as recommended by the EBI Infrastructure Advisory Group to prevent inadvertent denial-of-service on shared academic resources. This policy balances the bot’s need to collect large datasets with the operational stability of public genomic repositories.
Free Bot Analysis
Is Your Site Under Bot Attack Right Now?
Find out exactly how much of your traffic is automated — and which bots are draining your bandwidth and skewing your analytics.
Run Free Bot Scan →No credit card required · Results in minutes
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.