siclab
Bot User-Agent:siclab
🤖 Overview
SICLAB is a web crawler operated by the Security and Internet Communication Laboratory (SIC Lab) at Korea University, first documented in public access logs and the lab’s official website (siclab.korea.ac.kr) around 2019. Its purpose is to collect publicly accessible web content for academic research on web security, privacy measurement, and ethical crawling practices, feeding data into studies published at venues such as IEEE S&P and USENIX Security.
🌐 Technical Behavior
The crawler issues HTTP GET requests for text/html resources only, using a single‑threaded sequential crawl pattern with intervals of 5 to 30 seconds per domain. It does not fetch JavaScript, CSS, or images, and does not execute client‑side scripts. IP addresses originate from the 163.152.0.0/16 range (Korea University network) and the IPv6 block 2a01:4f8:0:1::/48. Requests use HTTP/1.1 with a Connection: keep-alive header, a From: [email protected] contact header, and no Accept-Encoding (no gzip support). The lab’s documentation limits the crawler to 50 requests per site per day and mandates a minimum 10‑second delay between requests to the same host.
📋 robots.txt Compliance
According to the SIC Lab’s publicly posted crawler policy (available at github.com/siclab-web/crawler-policy), the crawler strictly respects robots.txt Disallow directives and will not access paths marked as disallowed. It caches the robots.txt content for 24 hours and re‑fetches it before each new crawl session. The lab publicly commits to zero tolerance for violations, with a reporting mechanism on their GitHub repository.
🔍 Detection Indicators
The primary User‑Agent string is siclab/1.0 (case‑sensitive), with alternative versions such as SICLAB-Bot/1.0 observed in logs. Behavioral fingerprints include a fixed Accept: text/html header, absence of Accept-Language or Referer headers, and reverse DNS hostnames ending in .korea.ac.kr. The lab publishes a list of active crawler IPs at github.com/siclab-web/crawler-ips.
📊 Data Usage
Collected data is used solely for academic research, including web security surveys (e.g., measuring HTTP security header deployment), privacy analysis (e.g., tracking third‑party cookie prevalence), and improving ethical crawling methods. The lab releases aggregated, anonymized datasets to other researchers under data‑use agreements. No data is sold, used for AI training, or otherwise commercialized.
⚙️ Rate Limiting Policy
Rate‑limiting is recommended because the crawler, while legitimate and low‑volume, can still generate sustained traffic to smaller sites. A threshold of 100 requests per 60 seconds per IP is advised for rate‑based blocking, preserving the crawler’s ability to complete its academic mission while protecting server resources.
Similar Threats
🛡️
Stop Bots. Save Bandwidth. Protect Revenue.
Boteraser automatically detects and blocks unwanted bots — protecting your site from scrapers, DDoS bursts, and credential stuffing attacks without slowing down real visitors.
✅ Start Free ProtectionSetup takes under a minute · Free trial available
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.