Preventing usage spikes due to automated access
The Fluid Dynamics Search Engine can handle one search query every few seconds. FDSE will have trouble if it receives more than one query per second for a sustained period, or if there is a spike of many requests in a single second. In those cases, FDSE will begin to use significant CPU and memory resources. If high utilization occurs for an extended period of time, you (or your web host admnistrators) may need to shut down FDSE.
Here are some common reasons for high utilization, and ways you can avoid them.
-
Automated new URL submission
FDSE includes the option -- disabled by default -- for your visitors to add their own URL to the search index.
If you enable this, then companies which submit sites to search engines may discover your FDSE instance and add it to their automated systems. This can result in a flood of submissions that can bring FDSE down.
If this happens, you may want to disable visitor-added URLs entirely. You can also enable a few settings that limit automatic submissions; see Automatic submissions to the visitor-added URL form.
Note: when visitor-added URLs are allowed, it is much more effecient to require administrator approval for each URL, as opposed to accepting submissions immediately. This way, FDSE does not need to "touch" the index files for each new URL -- it can wait until an administrator approves a whole batch of them. The updating of the live index file, rather than the crawling of the URL, is often what takes up significant resources.
-
Searches by search engine crawlers
Search engine bots (like Googlebot, Scooter, etc.) will try to request every URL on your site, by default. If your pages link directly to search results, like:
View <a href="http://xav.com/search.pl?Terms=perl">pages about Perl</a>.then the search engine bots will make requests to /search.pl?Terms=perl. If those results span more than one page, then the bot will see the 1 - 2 - 3 - 4 - Next links, and will request those too. If you have direct links to multiple keywords, and multiple pages of results for each, then bots may have hundreds or thousands of possible search.pl queries to make. And they may make them all at the same time.
There are three ways to avoid this potential problem:
-
In your site's /robots.txt file, exclude access to the search.pl and proxy.pl files. If they are both in the same folder, with no other content, you can just exclude that folder:
User-Agent: * Disallow: /search/ Disallow: /cgi-bin/search/
-
In the FDSE template file:
/search/searchdata/templates/header.htm
make sure you have a conditional robots META tag which appears whenever there are terms present:
<% if terms %><meta name="robots" content="index,nofollow" /> <% end if %><meta http-equiv="Content-Type" content="%content_type%" />This META tag will prevent well-behaved search engines from following the 1 - 2 - 3 - 4 - Next links which appear at the bottom of multi-page search results.
-
Do not use direct "A HREF" links to search results.
You can also use forms or Javascript to link to results. These types of navigation usually are not followed by bots
Examples of non-HREF navigation:
<form method="get" action="http://xav.com/search.pl"> <input name="q" value="Perl" size="10" /> <input type="submit" value="Go" /> </form><form method="get" action="http://xav.com/search.pl"> <input name="q" value="Perl" type="hidden" /> <input type="submit" value="Search for Perl" /> </form>Click here to search for Perl
<p onclick="location.href='http://xav.com/search.pl?q=Perl'"> <u><b>Click here to search for Perl</b></u> </p> <noscript>(requires Javascript)</noscript>
-
-
Searches by auto-downloaders
Many users have utilities that attempt to download an entire website for offline viewing. These utilities act very similar to search engine bots -- they discover all links and follow them -- but they are more difficult to control, because they consider themselves exempt from the robots exclusion standard.
The only way to avoid these types of tools is to avoid direct "A HREF" linking, as described above. The /robots.txt and robots META header techniques won't work.
If you or your web host administrators beleive that FDSE is using too many resources, it is important to investigate whether automated queries are responsible. If so, then the problem can often be solved immediately, and you can go on enjoying FDSE.
To investigate the cause, you should scan your web server log files, looking at all requests to FDSE. If many requests come from the same IP or User-Agent, then the problem is likely automated queries.
"Preventing usage spikes due to automated access"
http://www.xav.com/scripts/search/help/1177.html