Extension:CrawlerProtection
Release status: stableCategory:Stable extensions |
|
|---|---|
| Implementation | HookCategory:Hook extensions |
| Description | Anti-crawler suite for MediaWiki |
| Author(s) | Jeffrey Wang (MyWikis-JeffreyWangtalk) |
| Latest version | 1.4.0 |
| Compatibility policy | Main branch maintains backward compatibility.Category:Extensions with master compatibility policy |
| MediaWiki | 1.39.4+Category:Extensions with manual MediaWiki version |
| Database changes | No |
|
|
| Licence | MIT License |
| Download | Category:Extensions in GitHub version control |
The CrawlerProtection extension blocks anonymous users from performing actions or accessing special pages most frequently abused by AI crawlers. It aims to prevent AI crawlers from accessing pages that are expensive to render, thereby reducing excessive resource consumption.
CrawlerProtection provides excellent, MediaWiki-specific protection at OSI Layer 7 (the Application Layer). To have a real chance at comprehensively protecting a wiki, special attention should be paid to protecting OSI Layers 3 and 4. Therefore, CrawlerProtection should always be used with another lower-level technology, such as a reverse proxy or WAF, which is capable of fending off malicious bot traffic.
CrawlerProtection was introduced at MediaWiki Users and Developers Workshop Spring 2025. The video recording of the introduction can be found on YouTube.
Handled features
By default, the following wiki features are disabled for anonymous users by CrawlerProtection:
- Page diffs
?type=revision?action=history?diff=1234?oldid=1234
- Special:MobileDiff
- Special:RecentChangesLinked
- Special:WhatLinksHere
See below for variables which can be customized to add more special pages and actions to be disabled for anonymous users.
Installation
- Download and place the file(s) in a directory called
CrawlerProtectionin yourextensions/folder. - Add the following code at the bottom of your LocalSettings.php file:
wfLoadExtension( 'CrawlerProtection' );
- Configure as required.
Done – Navigate to Special:Version on your wiki to verify that the extension is successfully installed.
Configuration
This extension requires no configuration since it uses sensible defaults. That said, you can easily customize what special pages to block. And you can even enable a fast-fail response to further reduce load on your application server.
$wgCrawlerProtectedSpecialPages
An array of special pages to protect, which as of 1.1.0 defaults to:
$wgCrawlerProtectedSpecialPages = [
'mobilediff',
'recentchangeslinked',
'whatlinkshere'
];
Supported values are special page names, or their aliases, regardless of case.
You should omit the Special: prefix.
Protecting more pages
The above pages are deliberately selected to optimize for enabling reasonable, maximal access for anonymous users, while blocking the heavy-hitting, resource-intensive special pages which are almost never needed for non-WMF use cases.
A few other pages, such as Special:RecentChanges, are good candidates for inclusion if so desired.
It is dubious whether the nuclear option of blocking almost all special pages is any more effective than simply blocking the above default pages, plus maybe a few more such as Special:RecentChanges.
If you wish to block more special pages, you can fetch a full list of special pages defined by your wiki using the API and jq with a simple bash one-liner like:
$ curl -s "[YOURWIKI]api.php?action=query&meta=siteinfo&siprop=specialpagealiases&format=json" | jq -r '.query.specialpagealiases[].aliases[]' | sort
Of course, certain special pages must be allowed, like Special:UserLogin, or else the wiki will break, so do not block everything.
Greg Rundlett has posted an allow list of special pages for a discussion on problematic special pages plus techniques of listing your special pages for this extension configuration or the equivalent configuration for Extension:Lockdown. The list is his own opinion and is not endorsed by the extension authors.
$wgCrawlerProtectedActions
Available as of extension version 1.3.0.
Add the name of the action to this array to block anonymous users from accessing it.
Actions include things like ?action=edit and ?action=history.
Defaults to:
$wgCrawlerProtectedActions = [
'history'
];
$wgCrawlerProtectionAllowedIPs
Available as of extension version 1.4.0.
Allow these specified IP addresses or IP address ranges (in CIDR notation) to bypass CrawlerProtection blocks.
This variable can be either a string (for just one IP address/IP range) or an array (for multiple IP addresses/ranges).
By default, no IP addresses are allowed to bypass by default.
$wgCrawlerProtectionRawDenial
By default, this is set to false.
When set to true, enables a fail-fast response consisting only of some text and a header value.
This is in contrast to the usual "Access denied" display on the special page, with MediaWiki's page and resources fully loaded.
Setting this value to true should reduce server overhead by quite a bit, but comes with the downside of a not-so-user-friendly error message.
When this is enabled, it by default will return and display 403 Unauthorized on the page.
However, with the variables defined below, this can be changed to something else.
Or, for 418 I'm A Teapot, simply set the convenience variable $wgCrawlerProtectionUse418 to true.
$wgCrawlerProtectionRawDenialHeader and $wgCrawlerProtectionRawDenialText
In most cases, these don't need to be changed. However, sometimes a custom denial header and/or text are desired. In such cases, they can be changed by setting these two variables.
By default, these are set to the sensible values of HTTP/1.0 403 Forbidden and 403 Forbidden. You must be logged in to view this page. respectively.
An example that allows a return link and back button might look like:
$wgCrawlerProtectionRawDenialText = "You must be logged in to view this page.<br><button onclick=\"history.back()\">Go Back</button> or return to the <a href=\"https://url_return_page\">Wiki Page</a>";
$wgCrawlerProtectionUse418
Sets whether the HTTP status code 418 should be used for a fail-fast response.
As of extension version 1.2.0, this will only have an effect if $wgCrawlerProtectionRawDenial is set to true.
Defaults to:
$wgCrawlerProtectionUse418 = false;
When set to true, no further processing is done.
Instead, MediaWiki responds with a HTTP 418 "I'm a Teapot" status code.
System messages
There are a couple of system messages (e.g. crawlerprotection-accessdenied-text) which will be shown when access is denied.[1]
You can override these system messages as desired on-wiki.
Version compatibility table
Only LTS versions of MediaWiki are officially supported, but it's generally safe to assume that in-between versions are also supported just fine. For instance, if a version supports both 1.43 and 1.39, then it's safe to assume that it works with 1.40, 1.41, and 1.42.
| 1.43+ | 1.39.17-1.39.4 | 1.39.3-1.39.0 | 1.35 | 1.34 and below | |
|---|---|---|---|---|---|
| 1.4.0 | Yes | Yes | No | No | No |
| 1.3.0-1.3.1 | |||||
| 1.2.0 | |||||
| 1.1.0 | Yes | Yes | |||
| 1.0.0 and earlier |
Frequently asked questions
- Does this extension harm SEO or otherwise prevent Google from crawling my wiki?
- This extension in no way prevents traditional legitimate crawlers respecting
robots.txt(such as those from Google) from accessing content pages of wikis with CrawlerProtection installed.
- Does hiding the history and diff pages from users violate the Creative Commons license?
- There is no requirement in any Creative Commons license to also show the history of a page. Therefore, this extension would not violate any Creative Commons license terms. (This is not legal advice.)
- Should I use CrawlerProtection, a web application firewall, or a reverse proxy?
- Being an extension, CrawlerProtection is inherently limited to operating at the application layer (i.e. OSI Layer 7) and is only meant to block MediaWiki-specific pages. Ideally, CrawlerProtection should be used in conjunction with a WAF, like Cloudflare or AWS CloudFront or Azure Front Door, or a reverse proxy, such as HAProxy or Nginx. Only by covering as many layers as possible can your wiki, web server, and server have the best chance of defending itself from AI crawler attacks.
Authors
CrawlerProtection was originally created and is maintained by Jeffrey Wang. MediaWiki 1.43 support was added by Skizzerz in June 2025. Many features on the wishlist, including more fine-grained settings, were added by Greg Rundlett November 2025. Marijn van Wezel added IP address allowlisting. This extension will continue to have features added to it with the support of the community.
