swish-e

Bot User-Agent: swish-e

🤖 Overview

swish-e (Simple Web Indexing System for Humans – Extended) is an open‑source, Unix‑based web crawler and indexing engine originally developed by Kevin Hughes at ENTERPRISE Integration Technologies and later maintained by a volunteer community. Its primary purpose is to allow site administrators to build full‑text search indexes of their own web content or small‑to‑medium sized public datasets; it is not a commercial search‑engine bot but rather a self‑deployed indexing tool. Official documentation and source code are hosted on the GitHub repository https://github.com/swish-e/swish-e and the legacy site http://swish-e.org (now redirecting to the GitHub archive).

🌐 Technical Behavior

swish‑e’s crawler component, invoked via the swish-e -c config.cgi command, performs a bounded, site‑specific crawl. It does not roam the internet indiscriminately; rather it follows links from a starting URL that the administrator configures, respecting a maximum depth (default 3) and limiting the number of pages (default 500). The crawler uses HTTP/1.1 requests with a configurable User-Agent string (typically swish-e/2.4.7 or a custom string set in the configuration) and does not employ any randomized delays unless explicitly added by the user. IP addresses are those of the server running the crawler — no fixed ranges apply since it is not a distributed botnet. It supports both HTTP and HTTPS, and can parse HTML, PDF, plain text, and XML documents via libxml2. The crawler is single‑threaded and sequential, making it non‑aggressive by default.

📋 robots.txt Compliance

swish‑e does not automatically honor robots.txt. The official documentation at http://swish-e.org/current/docs/INSTALL.html (archived) states that robots.txt handling is optional and must be explicitly enabled by setting RobotsTxt = yes in the configuration file. Without this flag, the crawler ignores robots.txt directives entirely, which can lead to unintended crawling of disallowed paths. This design is intentional because swish‑e is intended for site owners to index their own content — they are expected to control what is crawled via the URL list, not through external directives.

🔍 Detection Indicators

The default User-Agent string is swish-e/{version} (e.g., swish-e/2.4.7). However, administrators can override this to any value. Additional fingerprinting indicators include the use of HTTP Accept header values common to libwww‑based tools and the absence of standard commercial bot headers like From. The crawler does not send a Referer header. In web server logs, requests from swish‑e will show a consistent request rate (often one request per second) and a missing Accept-Encoding header (unless the compiled libcurl version adds it). The GitHub repository notes that the crawler uses libwww or libcurl depending on the build, so the exact HTTP behavior may vary.

📊 Data Usage

Collected data is used exclusively to build a local search index of the crawled pages. The index stores word positions, document titles, and metadata (e.g., file size, modification date) to enable fast boolean keyword searches via the swish-e command line interface or the demonstration CGI script swish.cgi. No data is sent to third parties; the index is private to the deploying organization. swish‑e is often used for intranet sites, documentation archives, and personal websites that require a lightweight, self‑hosted search engine.

⚙️ Rate Limiting Policy

Because swish‑e is a self‑deployed tool, rate limiting is entirely the responsibility of the administrator who runs it. For external web applications that see requests from an unknown swish‑e instance, a threshold of 5 requests per minute per IP is recommended as a defensive measure against unintentional aggressive crawling, especially since the default configuration does not enforce polite delays. This rate‑limit policy protects server resources while still allowing legitimate indexing of owned content when the operator has not configured polite behavior.

Free Bot Analysis

Is Your Site Under Bot Attack Right Now?

Find out exactly how much of your traffic is automated — and which bots are draining your bandwidth and skewing your analytics.

Run Free Bot Scan →

No credit card required  ·  Results in minutes

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.