nwspider
Crawler User-Agent:nwspider
🤖 Overview
nwspider is a legitimate web crawler operated by NewsWhip, an Irish social media monitoring and analytics company headquartered in Dublin. First publicly documented around 2015, its primary purpose is to collect publicly available news articles, blog posts, and online content to feed into NewsWhip’s Spike platform, which provides real-time content discovery, trend analysis, and social media influence scoring for publishers, brands, and PR professionals. NewsWhip’s crawler is designed to identify and index breaking news and viral content, enabling the platform’s predictive analytics capabilities.
🌐 Technical Behavior
The nwspider bot typically requests pages using HTTP/1.1 with standard GET methods and advertises itself via the User-Agent string nwspider/1.0 (sometimes with a version suffix). It follows robots.txt crawl-delay instructions when present and will optionally respect meta robot tags such as noindex and nofollow. Crawl frequency varies by site but can be aggressive, often making multiple requests per minute to high-news-volume domains, though NewsWhip states it honors standard rate-limiting directives. The bot’s IP ranges are not publicly disclosed in extensive detail, but it originates from a pool of Amazon Web Services (AWS) EC2 instances, primarily in US-East and EU-West regions, with sporadic use of other cloud providers confirmed by network logs shared in security forums. According to NewsWhip’s official documentation (available at newswhip.com/our-bots/), the crawler supports both HTTP/1.0 and HTTP/1.1, and it does not send a From header but includes a User-Agent field for identification.
📋 robots.txt Compliance
NewsWhip’s official policy explicitly states that nwspider fully obeys robots.txt directives, including Disallow and Crawl-Delay rules, as confirmed in their public bot description page at newswhip.com/robots.txt. Additionally, the company provides a /nwspider path on their website where webmasters can request temporary blocks or custom crawl limits, demonstrating a documented willingness to cooperate with site owners. Despite this, some webmasters have reported occasional aggressive crawling after rule updates, but NewsWhip attributes this to configuration propagation delays rather than intentional non-compliance.
🔍 Detection Indicators
The primary detection indicator is the User-Agent string: nwspider/1.0 (or nwspider/2.0 in newer versions, though less common). Behavioral fingerprints include a lack of Accept-Encoding compression in many requests and a tendency to fetch pages in bursts separated by 2–10 second pauses. The bot sends a standard Accept header (text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8) and does not include typical browser headers like Referer or Cookie, making it distinguishable from human traffic. No custom X- headers are used, but the IP pattern (AWS) combined with the User-Agent string is the most reliable fingerprint.
📊 Data Usage
Collected content is used to feed NewsWhip’s content discovery and analytics platform, which processes crawled articles to generate real-time engagement metrics, trending topic alerts, and social media influence scores. The data is not redistributed raw; instead, it is aggregated and anonymized within NewsWhip’s proprietary algorithms—notably their Spike prediction engine—to help customers identify which stories are gaining traction on social networks. NewsWhip does not use the data for general-purpose AI training, nor does it sell the crawled text; its business model centers on analytics and trend visualisation.
⚙️ Rate Limiting Policy
Rate limiting on nwspider is applied because its crawl frequency can spike during high-volume news events, potentially impacting server performance for smaller sites. The policy rationale is threshold-based: site owners are encouraged to set a Crawl-Delay in their robots.txt (e.g., 5–10 seconds) which NewsWhip honors, and if no delay is specified, they recommend IP-based throttling at the web server level after detecting the User-Agent, to prevent resource exhaustion without outright blocking a legitimate bot.
🛡️
Stop Bots. Save Bandwidth. Protect Revenue.
Boteraser automatically detects and blocks unwanted bots — protecting your site from scrapers, DDoS bursts, and credential stuffing attacks without slowing down real visitors.
✅ Start Free ProtectionSetup takes under a minute · Free trial available
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.