ufam-crawler-

Crawler User-Agent: ufam-crawler

🤖 Overview

The ufam-crawler is a legitimate academic web crawler operated by the Federal University of Amazonas (UFAM) in Brazil, specifically developed by the Laboratory of Intelligent Systems and Web (LSIWeb) within the Institute of Computing. Its primary purpose is to collect publicly available web content for academic research, including natural language processing (NLP) corpora for Brazilian Portuguese, web archiving experiments, and large-scale web graph analysis. The crawler feeds data into internal UFAM research projects and is not associated with any commercial search engine or AI training product.

🌐 Technical Behavior

The ufam-crawler operates using a custom Python-based scraping engine that performs both breadth-first and focused crawling strategies. According to published documentation from UFAM’s LSIWeb group (available on the university’s wiki at http://lsiweb.icomp.ufam.edu.br/ufam-crawler/), the crawler respects standard HTTP protocols, sends requests with a default interval of 500 milliseconds between pages, and limits concurrent connections to 5. It crawls primarily over IPv4, originating from IP ranges assigned to the Brazilian academic backbone (RNP), such as 200.129.0.0/16 and 201.51.0.0/18. The bot follows HTTP redirects (up to 3 hops) and parses robots.txt and sitemaps before initiating a crawl session. It does NOT execute JavaScript.

📋 robots.txt Compliance

Based on the crawler’s official configuration files hosted on UFAM’s GitLab instance (gitlab.icomp.ufam.edu.br/lsiweb/ufam-crawler), the ufam-crawler fully honors Disallow directives from robots.txt. The crawler’s policy document explicitly states that it checks robots.txt before every crawl and will skip any URL listed in disallowed paths. There are no known incidents of the crawler ignoring exclusion rules.

🔍 Detection Indicators

The primary User-Agent string is Mozilla/5.0 (compatible; ufam-crawler/1.0; +http://lsiweb.icomp.ufam.edu.br/ufam-crawler). Secondary variants include user agents with version suffixes like ufam-crawler/1.1. The bot also sets a custom HTTP header X-ufam-Crawl-ID with a UUID for traceability. Behavioral fingerprints include a consistent crawl rate of 2 requests per second (configurable), no JavaScript rendering, and a preference for text/html, application/pdf, and application/xml content types.

📊 Data Usage

Collected data is used exclusively for non-commercial academic purposes, such as training Brazilian Portuguese language models, studying web structure evolution, and building benchmark datasets for information retrieval research. The UFAM LSIWeb group publishes raw crawl logs and aggregated statistics on their official website, and has released anonymized versions of certain corpora under Creative Commons licenses. Data is stored on UFAM’s internal servers and is not shared with third parties.

⚙️ Rate Limiting Policy

Although the ufam-crawler is legitimate and obeys standard crawl policies, it is rate-limited in practice because its aggressive academic crawl campaigns—sometimes spanning millions of URLs in a single week—can cause elevated load on origin servers. Threshold-based blocking (e.g., 25 requests in 10 seconds) provides a safety buffer to prevent resource exhaustion while still allowing the university’s research to proceed legitimately.

53% of Web Traffic Is Bots in 2026

— Imperva Bad Bot Report 2026

How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server — completely free.

📊 Get My Bot Report

Sign up in seconds  ·  No card required

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.