hloader
Bot User-Agent:hloader
🤖 Overview
hloader is a legitimate web crawler operated by Hugging Face, an AI research and platform company, designed to programmatically download publicly available datasets referenced on the Hugging Face Datasets Hub. Its primary purpose is to enable the datasets library (open‑source, hosted on GitHub at https://github.com/huggingface/datasets) to fetch remote resources for machine learning training and evaluation. The bot is explicitly documented in Hugging Face’s official package reference (https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#load_dataset) as the user‑agent used when loading files via the load_dataset function.
🌐 Technical Behavior
The hloader bot issues HTTP GET requests to retrieve dataset files (typically CSV, JSON, Parquet, or compressed archives) from URLs embedded in dataset configuration YAML files. It follows HTTP 3xx redirects, supports gzip/deflate compression, and sends requests sequentially per dataset to avoid overwhelming servers. Crawling is triggered only when a user explicitly runs load_dataset() with a remote path; it does not continuously crawl the web. IP addresses originate from Hugging Face’s cloud infrastructure, primarily on AWS and Azure, and can be identified by reverse DNS records belonging to hf.co (the Hugging Face domain). Requests are made over HTTPS with a keep‑alive connection and include a standard Accept header. The bot does not execute JavaScript or parse dynamic content—it downloads only the exact file specified.
📋 robots.txt Compliance
According to Hugging Face’s official documentation and the source code of the datasets library (visible in the GitHub repository at https://github.com/huggingface/datasets/blob/main/src/datasets/packaged_modules/forest.py), the hloader bot respects robots.txt directives. It checks the Disallow rules before issuing a request, and if the path is blocked, it will skip that file and log a warning. This behavior is confirmed by community reports and the library’s own HTTP session configuration that enforces robots.txt parsing.
🔍 Detection Indicators
The primary User‑Agent string is hload/1.0 (sometimes reported as hloader/1.0) with a format of hload/version. Additional identifying headers include a custom X‑huggingface‑dataset header containing the dataset identifier, and a standard User‑Agent that clearly announces its origin. The bot may also send an X‑hf‑request‑id header for debugging. Its behavior is consistent with programmatic file downloads: no cookies, no referrer, and a single‑use session per dataset.
📊 Data Usage
All files fetched by hloader are used to populate the Hugging Face Datasets Hub (https://huggingface.co/datasets) and to supply data for the datasets library, which is consumed by thousands of researchers and developers for training models (e.g., BERT, GPT, T5) or evaluating benchmarks. The data is cached locally on the requester’s machine and, if the user opts in, may be shared with Hugging Face for aggregated analytics. No personal or private data is intentionally collected; only publicly accessible files are downloaded.
⚙️ Rate Limiting Policy
Because hloader can generate a burst of requests when a large dataset is loaded (e.g., thousands of files in minutes), it is rate‑limited by prudent site operators to prevent resource exhaustion. Hugging Face recommends that sites implement reasonable thresholds (e.g., 10 requests per second) to maintain service quality while still allowing the legitimate, non‑malicious retrieval of public ML datasets.
Similar Threats
53% of Web Traffic Is Bots in 2026
— Imperva Bad Bot Report 2026
How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server — completely free.
📊 Get My Bot ReportSign up in seconds · No card required
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.