How to isolate and handle rogue robots
Definition: A robot is an automated process which visits web pages. A robot is "rogue" if it fails to adhere to the Robots Exclusion Protocol which is documented at www.robotstxt.org.
Background: Robots can place extremely heavy traffic on a site. Also, when robots visit a web page, they tend to make deliberate use of the information found there, such as to seed a search engine or gather email addresses for spamming. Sometimes a web author does not want his server to be subject to the heavy traffic or does not want his information used by the robots. In these cases, he can use a "robots.txt" file in the root of his server to notify robots that they are not welcome. All robots are required to check for the "robots.txt" file and obey its directives.
It is very easy to create a robot which does not respect the protocol. There are three main reasons for robots to not respect the protocol:
Ignorance. The robot writer does not know that there is a protocol.
Arrogance. The robot writer assumes that all sites will always welcome his robot. Often these writers think "i'm not really a robot because i'm not a search engine indexer". Programs which download an entire site for offline viewing often fall into this category.
Hostile intent. The robot writer knows that the web site owner does not want to be visited, but chooses to visit the site anyway.
All of these reasons are excellent reasons for the web site owner to classify the robot as "rogue" and to take countermeasures against it.
How to identify a rogue robot
Summary: To identify a rogue robot, we create a forbidden zone on the web site that is off limits to robots. Then, we try to lure robots in to that area by creating links to it. Finally, we place a trigger in the forbidden zone so that anyone who enters it will trigger a pre-programmed Guardian response.
Create a folder named "aaaa" on your site.
Add the following directive to the top of your robots.txt file. Create the robots.txt file if necessary:
User-Agent: * Disallow: /aaaa/
Note: do not use this approach if you have already granted special privileges to certain user-agents using the "Disallow/null" directive, since they are allowed to ignore the rest of the robots.txt file. We don't want to trap innocent robots.
In the /aaaa/ folder, create a file named "index.html". Put the following text in the file:
<meta name="robots" content="none,noindex,nofollow"> <a href="/aaaa/kill/"></a>
On the main default document of your site, towards the top, add the following link. Do not put any text inside the A tag. This way, humans will not be able to click on the link, but robots will still see it:
The path /aaaa/kill/ on your server is now a trigger. The robots.txt file and robots META tag warn people to stay away, while the two links remain as lures. Those who ignore the warnings and follow the lures will set off the trap.
How to respond to a rogue robot
We link to the trigger path /aaaa/kill/, but that path should not actually exist on your server. In this way, when requested, it will cause a 404 Not Found error which is handled by Guardian. An example of a handler is below. More detailed information on how to set up handlers can be found in the help topics listed below.
== # handle rogue robots url-substring: /aaaa/kill blacklist: /usr/www/users/xav/.htaccess ==
The blacklist and dos responses are effective ways to deal with hostile traffic. They are described in more detail in the article Best practices for handling hostile probes.
In addition, you can harass robots by exploiting the fact that they often must download and parse whatever text they receive. The replace reaction type allows you to send down huge (say, 100 MB) files that are full of links and email address. Attempting to parse these files may cause the robot to choke and die.
More information on Guardian filter rules are found in the article Overview of Guardian Filter Rules.
"How to isolate and handle rogue robots" http://www.xav.com/scripts/guardian/help/1702.html