mediawords Bot — Detection, Blocking & Technical Analysis

mediawords

Bot User-Agent: mediawords

🤖 Overview

MediaWords is a web crawler operated by the Media Cloud project, a joint initiative of the Berkman Klein Center for Internet & Society at Harvard University and the Center for Civic Media at MIT. Its primary purpose is to systematically collect news articles, blog posts, and online media content for academic research on media ecosystems, framing, and information diffusion. The data feeds into the Media Cloud platform, which provides analytical tools for journalists, scholars, and the public. The crawler has been active since the early 2010s and is described in their official documentation at mediacloud.org and the associated GitHub repository github.com/mediacloud/media-cloud.

🌐 Technical Behavior

The MediaWords crawler operates by fetching RSS/Atom feeds from a curated list of thousands of news sources, then downloading full article HTML pages. It uses a single-threaded per-domain approach with configurable delays, typically 10–30 seconds between requests to the same host, and a maximum of 2–5 concurrent connections per domain. The crawler’s IP ranges are not publicly documented in a single WHOIS block, but traffic originates from Harvard University and MIT IP allocations (e.g., 128.103.x.x, 18.9.x.x) and occasionally from cloud providers. It follows HTTP/1.1 with keep-alive, sends a User-Agent of MediaWords/1.0 and includes a From header with a contact email ([email protected]) as prescribed in RFC 1945. The crawler does not visit pages outside those linked in feeds unless explicitly configured to follow internal links for article extraction.

📋 robots.txt Compliance

MediaWords fully honors robots.txt disallow directives. The official Media Cloud documentation explicitly states that the crawler checks robots.txt before every fetch and respects Crawl-Delay directives. If a site blocks the crawler via robots.txt, MediaWords will not revisit that domain for a minimum of 30 days. This policy is implemented in the open-source Python codebase available at github.com/mediacloud/media-cloud under the crawler/robots.py module.

🔍 Detection Indicators

The primary detection fingerprint is the User-Agent string: MediaWords/1.0 (or variants like MediaWords/0.1). No additional custom headers are used beyond standard HTTP. The crawler often sends a Referer header equal to the feed URL that led to the page, and the Accept header is text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8. Behavioral fingerprints include a consistent request pattern of GET only, no POST or HEAD, and a lack of JavaScript execution. The crawler does not set cookies or attempt session management.

📊 Data Usage

Collected data is used exclusively for academic media monitoring and research. Media Cloud allows users to query the archive via API to analyze topics, sources, and linguistic patterns. The data is not used for AI training or commercial search indexing. It is stored in a PostgreSQL database and served through a Django-based web interface. All collected content is treated as public domain or used under fair use for non‑profit educational purposes, as stated in their Terms of Use at mediacloud.org/tos.

⚙️ Rate Limiting Policy

Because MediaWords can make thousands of requests per day across many domains, it is rate‑limited to prevent resource exhaustion on single servers. A threshold‑based blocking policy (e.g., more than 500 requests per hour from a single IP to one host) is applied as a standard good‑neighbor practice, consistent with the project’s own guidelines for ethical crawling documented on their GitHub wiki.

Similar Threats

🛡️

Stop Bots. Save Bandwidth. Protect Revenue.

Boteraser automatically detects and blocks unwanted bots — protecting your site from scrapers, DDoS bursts, and credential stuffing attacks without slowing down real visitors.

✅ Start Free Protection

Setup takes under a minute · Free trial available

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.

mediawords

🤖 Overview

🌐 Technical Behavior

📋 robots.txt Compliance

🔍 Detection Indicators

📊 Data Usage

⚙️ Rate Limiting Policy

Stop Bots. Save Bandwidth. Protect Revenue.

Company

Resources

Services

Trusted

Subscribe