Home > Fluid Dynamics Search Engine > Help > 1051

Following Links with Crawler

When the crawler visits a web page, it can also visit links and frames found on the page. Doing so is a three-part process: link detection, storage, and crawling. Each part of the process is discussed below.

Following Links with Crawler: Link Detection

The crawler treats the following patterns as links: a href="address", FRAME SRC="address", IFRAME HREF="address". Note that this will only capture HTML links, not those written with a scripting language. To customize the patterns that are treated as links, edit the "Link:" loop at the beginning of the parse_html_ex subroutine.

Once found, each link is compared against several rules. Only links which pass all the rules will make it to the next step. The rules are:

Protocol - the crawler can only "speak" the HTTP protocol, so links must begin with "http://". Links beginning with "https://", "ftp://", "gopher://" or any other protocol will be ignored.

Address length - if a web address is too long, the crawler will ignore it. The maximum size of an address is set with the "Max Characters: URL" setting, which is 128 by default.

Query strings - by default, the crawler will ignore links of the form HREF="file.pl?test" because there is a question mark in the address. Any web address which contains a question mark is said to have a query string. The crawler's behavior for these documents is controlled by the "Crawler: Follow Query Strings" General Setting.

File extension - some links point to files that are almost certainly not readable text. The crawler will skip links if the file extension is suspected of representing non-text data. These extensions are stored in a list, whose format is a set of lowercase file extensions, separated by the pipe character. The list of extensions are stored in the "Crawler: Ignore Links To" General Setting.

Note that this rule can be extended to cover text files which are readable but still shouldn't be indexed; the ".log" or ".old" extensions are good examples.

Robots Exclusion - a file which contains the Meta tag <meta name="robots" content="nofollow" /> will not have internal links followed, as required by the Robots Exclusion Standard. Setting "Crawler: Rogue" to 1 will cause the crawler to ignore the exclusion standard, but this is frowned upon by the Internet community.

Hostname - some links point to remote websites. To restrict the crawler to only the initial website, set setting "Crawler: Follow Offsite Links" to 0.

In addition to these rules, the crawler will also stay on the same website if the "Index Entire Site" option is checked at the beginning of the crawl session. The "Index Entire Site" option is more restrictive than the "Crawler: Follow Offsite Links" setting, because it limits followed links to the same directory or a subdirectory of the initial document. The offsite links rule only limits the links to the same hostname.

Following Links with Crawler: Storage

All links that are found are inserted to the pending pages file, search.pending.txt. This file lists all pending documents in alphabetic order. The file includes a list of all pages that have already been searched, those that have been crawled but returned errors, and also those waiting to be searched.

The format of the file is:

http://address/ RealmName State

"State" is a number. If it is 0, then the address is waiting to be crawled. If it is 2, then the crawler tried to index this page before, but encountered an error (the address is now blacklisted *). If it is a large number, then it represents the time that the address was indexed.

If an address is found while crawling, and that address already shows up in the pending file, then no action is taken. The original state of the address - i.e., being indexed, being an error page, or awaiting indexing - always takes precedence. If the user has just used the "Index Remote Web Page" form, then all the addresses found will be listed along with their state (already indexed, error, or awaiting).

If an address is found that hasn't been encountered before, then it will be inserted into the pending file. It will have state "0", which means that it is ready to be crawled.

Following Links with Crawler: Crawling

Usually links aren't followed automatically.

To follow links in a specific document, use the "Index Remote Web Page" form on the main admin page to get the initial document. The result page will list all links found, and from there you can click to follow the desired links.

However, if the "Index Entire Site" option is chosen along with the "Index Remote Web Page" form, then all links on that site will be followed automatically.

After a page is crawled, the pending file is updated with the results of the crawl. That state will be updated to "2" if there is an error *, or to the time if there is not an error. Time is an integer representing the number of seconds since 1970. This time value is used by other parts of the script to determine whether it is time to refresh the database.

* The set of blacklisted files can be cleared by going to "Admin Page" => "Data Storage" => "Clear Error Cache" (available in FDSE version and newer; prior to this, the search.pending.txt file can be manually edited).

    "Following Links with Crawler"