Home > Fluid Dynamics Search Engine > Help > 1124

Filter Rules: Using the "no update on redirect" rule

Background on Redirects

Some web pages redirect to other pages. The FDSE web crawler tries to follow all such redirects. Once the crawler finally arrives at a real, non-redirecting web page, it will record all the content it finds there and, by default, it will store the content using the URL of that final web page.

For example, consider what happens when somebody follows a "buy my book" link on an author's page:

Requesting 'http://www.xav.com/track.cgi?23'... 0 sec
Document redirects to 'http://affiliates.amazon.com/r/af=zoltanm/bookid=322'

Requesting 'http://affiliates.amazon.com/r/af=zoltanm/bookid=322'... 1 sec
Document redirects to 'http://affiliates.amazon.com/r/af=zoltanm/bookid=322/userid=2/show'

Requesting 'http://affiliates.amazon.com/r/af=zoltanm/bookid=322/userid=2/show'... 1 sec
Document redirects to 'http://www.amazon.com/specials/322/'

Requesting 'http://www.amazon.com/specials/322/'... 2 sec
Read 25,154 bytes

The person following the link notices only a slight delay and then sees the final page. They are often unaware that redirection is occuring. In this example, the first link is to a tracker/redirector CGI script on the author's site, perhaps one that logs which links people follow. The next two redirects take the visitor through a cookie-setting system which is used to track affiliate sales and pay referral commissions to the author. The final request is for the actual HTML file about the book and the associated order form.

FDSE Default Behavior

By default, FDSE will record only the final URL and the text found at that final URL. It will tag the first URL and intermediary URL's as errors because there was not text content associated with them; if any records are in the index with those URL's, those records will be removed.

In the example above, that means only the final URL http://www.amazon.com/specials/322/ will be shown to visitors in the search results. When they follow that link in the search results, they will be short-cutted to their final destination, without being tracked and tagged by the various redirecting URL's.

There are three reasons for this behavior:

Reasons to Override Default Behavior

Sometimes the FDSE administrator will wish for his visitors to explicitly follow all steps in the redirect path, rather than short-cutting to the final URL. For this reason the "no update on redirect" rule exists. When that rule is applied, the text, title, and description of the final URL is stored in the index, searched, and used to create the search results display; but the actual URL displayed to visitors and followed by them is the first URL at the beginning of the redirect chain.

Some scenarios where this is needed are:

FDSE installs by default with an "Affiliates" filter rule that uses the "no update on redirect" action for amazon.com and linksynergy.com links.

How to Override Default Behavior in Some Cases

To create a "no update on redirect" filter rule, follow these steps:

  1. Log in to the "Admin Page".

  2. Choose the "Filter Rules" link from the navigation menu. From that page, choose Filter Rules - Create New Rule.

  3. You will be taken to the Create or Update Rule page. Enter these values:

    Name: My Rule
    Enabled: [x] (checked)
    Action: (*) Do not update URL during redirects
    Analyze: (*) URL
    Minimum Occurrences: 1
    (*) Apply rule only...
    Strings: (enter URL's or hostnames here)
    Scope: (*) Apply to all realms

  4. Click the "Save Data" button to save your new rule.

  5. Test the rule by going to "Admin Page" and adding a redirecting URL in the "Add New URL" form. Make sure the original URL has been listed in the filter rule "strings" section that was just created. After indexing, the original URL should be retained.

Note: the "no-update" type of filter rules can analyze only hostnames and URL's, not the document text. This is because these types of rules are triggered before the document is actually received, and because redirecting URL's do not have any document text to speak of anyway.

The hostnames and URL's that you enter in the "strings" or "pattern" sections are applied to the initial URL, not the final or intermediary URL. In the example listed above, the string "www.xav.com" would be needed to prevent the system from logging the final URL http://www.amazon.com/specials/322/. Entering "affiliates.amazon.com" as a string would not work because the intermediary redirect URL's are not analyzed, only the first URL.

How to Override Default Behavior in All Cases

To make it so that FDSE always indexes the first URL in a chain, rather than the final one, simply create a "no update on redirect" rule, set it to analyze the "URL", and enter the string "http" in the "strings" section. This rule will thus apply to all documents.

Definition of Redirect

FDSE will treat any HTTP 300-series response, accompanied by a Location: header, as a redirect.

It will also treat as a redirect any HTML file containing a "meta refresh" with a trigger time of less than 10 seconds. The search for a "meta refresh" header is done against the first 4096 raw bytes of the file, and can generate false positives on refresh tags that are commented out or that are buried in conditional SCRIPT blocks. If you need to override the parsing of "meta refresh" tags, you can edit subroutine process_text in library "searchmods/common_parse_pages.pl".

Complexities

All documents are analyzed to see whether the "no update" rule should be applied. The default state is to not apply the rule. If any single filter rules causes the "no update" bit to be set, then it will stop processing more rules and keep the "no update" bit. This means that it is not possible to have a general "no update"-true rule applying to all pages at www.xav.com and then to add a more specific "no update"-false rule applying to www.xav.com/scripts. If you need to do something fancy, you can use the "patterns" section instead of the "strings" and then you will have all the power and versatility of Perl regex.

The "fdse-index-as" feature is implemented under-the-hood as a redirect. If a "no update" rule applies to the original URL, then any "fdse-index-as" header in the document will be ignored.


    "Filter Rules: Using the "no update on redirect" rule"
    http://www.xav.com/scripts/search/help/1124.html