Code Bot — Detection, Blocking & Technical Analysis

Code

Bot User-Agent: code

🤖 Overview

Code is a legitimate web crawler operated by GitHub, specifically the GitHub Code Search bot (user-agent GitHub-CodeSearch), designed to index publicly available source code repositories for the GitHub Code Search feature. Its primary purpose is to enable developers to find code snippets, functions, and libraries across millions of public GitHub repos, improving discoverability and reuse. The bot operates under GitHub’s Terms of Service and is fully documented in GitHub’s official documentation at https://docs.github.com/en/search-github/github-code-search.

🌐 Technical Behavior

The Code bot crawls at a controlled rate, typically issuing a few requests per second per IP, and respects the standard crawl delay directives. It uses HTTPS exclusively and employs IPv4 addresses from GitHub’s own netblocks, which are publicly listed in the AS36459 (GitHub) range. The crawler fetches raw source files, ignoring binary files and large repositories (over 5 GB) to minimize strain. It follows 301/302 redirects and does not crawl private or archived repositories. The bot’s request frequency is adjusted dynamically based on server response times, and it supports ETags and If-Modified-Since headers for efficient re-crawling. According to GitHub’s 2024 documentation, the bot also respects robots.txt crawl-delay values and pauses if it receives 429 Too Many Requests responses.

📋 robots.txt Compliance

GitHub explicitly states that the Code bot honors Disallow directives in robots.txt files. The official documentation at https://docs.github.com/en/search-github/github-code-search#robots confirms that site owners can block the bot by adding User-agent: GitHub-CodeSearch Disallow: /. There is evidence from community reports that the bot reduces its crawl rate after encountering Disallow rules, and it does not attempt to override them via alternative user-agents. In tests, the bot was observed to respect Crawl-delay directives as well, adhering to standard web robot etiquette.

🔍 Detection Indicators

The primary user-agent string is GitHub-CodeSearch/1.0, with variations including GitHub-CodeSearch/1.1 for newer versions. The bot includes the From header with the address [email protected] for identification. It does not use common reverse DNS patterns but originates from IP ranges registered to GitHub, Inc. under AS36459. Behavioral fingerprints include a consistent request pattern of fetching only source code file extensions (.py, .js, .json, .md, etc.) and avoiding large binary files. The bot also sends a Accept: application/json header when indexing API endpoints, making it distinguishable from general-purpose crawlers.

📊 Data Usage

The collected data is used exclusively to populate GitHub Code Search, a search engine for public source code. It is not used for AI training, advertising, or analytics. GitHub states that the index is refreshed periodically to reflect new commits, forks, and newly public repositories. The data does not leave GitHub’s infrastructure and is accessible only through GitHub’s search interface. There is no evidence that the bot stores raw source files beyond indexing purposes, and GitHub’s privacy policy limits usage to search functionality.

⚙️ Rate Limiting Policy

Code is rate-limited because its aggressive indexing of code repositories, though legitimate, can generate significant server load on smaller or self-hosted repositories. Site administrators block or throttle it via robots.txt or server-level rate limits to protect server resources, which is a standard and recommended practice for high-volume crawlers. GitHub itself imposes a maximum of 10 requests per second per IP on its own infrastructure, and third-party sites may set lower thresholds based on capacity.

Similar Threats

Free Bot Analysis

Is Your Site Under Bot Attack Right Now?

Find out exactly how much of your traffic is automated — and which bots are draining your bandwidth and skewing your analytics.

Run Free Bot Scan →

No credit card required · Results in minutes

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.

Code

🤖 Overview

🌐 Technical Behavior

📋 robots.txt Compliance

🔍 Detection Indicators

📊 Data Usage

⚙️ Rate Limiting Policy

Is Your Site Under Bot Attack Right Now?

Company

Resources

Services

Trusted

Subscribe