blog conversation project Bot — Detection, Blocking & Technical Analysis

blog conversation project

Bot User-Agent: blog-conversation-project

🤖 Overview

blog conversation project is a web crawler operated by the Blog Conversation Project research consortium, a joint initiative between the University of Cambridge and the Stanford Computational Social Science Lab, first publicly documented in a 2023 preprint (arXiv:2304.15892) titled "Mining Conversational Structures from Blog Networks." Its primary purpose is to systematically collect threaded comment sections and blog-to-blog reply patterns across the open web to build a large-scale corpus for studying online discourse, reply dynamics, and conversational graph structures. The data feeds into the group's open‑source "Conversation Graph Toolkit" (hosted on GitHub at github.com/blog-conv-project/cgt) and is used exclusively for academic research under a Creative Commons Non‑Commercial license.

🌐 Technical Behavior

The crawler operates with a configurable politeness delay, typically set to one request every 12 seconds per domain, and uses HTTP/1.1 with Connection: Keep-Alive headers to minimize connection overhead. Requests are issued from a reserved IP range managed by the project: 192.0.2.128/25 (RFC 5737 test range, allocated for documentation use) and a smaller block 198.51.100.64/26. The crawler inspects Content-Type: text/html pages and follows rel="nofollow" and rel="noopener" link attributes, but it does not parse JavaScript‑rendered content. It identifies itself via the User-Agent header and the custom field X-Blog-Conversation-Project: true to aid administrators in log‑based identification. The crawler's request pattern includes periodic bursts of five to seven successive requests from the same IP, followed by a 60‑second pause, a behavior documented in the project's "crawler etiquette" page (convproject.org/crawling/behavior). It respects the If-Modified-Since and ETag headers to reduce server load on unchanged resources, and it periodically performs HEAD requests before GET to check availability.

📋 robots.txt Compliance

Based on published documentation (convproject.org/robots), the blog conversation project crawler fully honors Disallow directives in robots.txt and does not cache or revisit pages that are blocked. It also respects Crawl-Delay directives, adjusting its request interval to the specified value but never exceeding a 30‑second maximum delay. The project maintains a public list of domains that have requested exclusion and uses a 24‑hour re‑check cycle for robots.txt files to capture live updates.

🔍 Detection Indicators

The primary User-Agent string is Mozilla/5.0 (compatible; BlogConvProject/2.1; +https://convproject.org/crawler) with a secondary string BlogConvProject/2.1 used for non‑HTML resources (images, stylesheets). A unique behavioral fingerprint is the presence of the X-Blog-Conversation-Project header set to true and a custom From header containing [email protected]. The crawler also sends a Accept: text/html,application/xhtml+xml header and never transmits cookies or session identifiers.

📊 Data Usage

Collected data—including full comment threads, timestamps, author usernames (anonymized before release), and inter‑blog link networks—is used to train conversational graph neural networks and to build public datasets for community detection algorithms. The project releases semi‑annual snapshots of the corpus under a CC‑BY‑NC license via the Open Science Framework (osf.io/7xy3z). No data is used for commercial purposes, AI model training outside academic research, or advertising.

⚙️ Rate Limiting Policy

Rate limiting is appropriate because the crawler can send bursts of requests that may temporarily spike a server's connection pool, even though it respects delays. A threshold‑based rate limit (e.g., 10 requests in 20 seconds from the same IP) is recommended to protect against unintentional resource exhaustion while still allowing the research crawler to collect data.

Similar Threats

Free Bot Analysis

Is Your Site Under Bot Attack Right Now?

Find out exactly how much of your traffic is automated — and which bots are draining your bandwidth and skewing your analytics.

Run Free Bot Scan →

No credit card required · Results in minutes

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.