Filter Rules: Using the "no update on redirect" rule
Background on Redirects
Some web pages redirect to other pages. The FDSE web crawler tries to follow all such redirects. Once the crawler finally arrives at a real, non-redirecting web page, it will record all the content it finds there and, by default, it will store the content using the URL of that final web page.
For example, consider what happens when somebody follows a "buy my book" link on an author's page:
Requesting 'http://www.xav.com/track.cgi?23'... 0 sec Document redirects to 'http://affiliates.amazon.com/r/af=zoltanm/bookid=322' Requesting 'http://affiliates.amazon.com/r/af=zoltanm/bookid=322'... 1 sec Document redirects to 'http://affiliates.amazon.com/r/af=zoltanm/bookid=322/userid=2/show' Requesting 'http://affiliates.amazon.com/r/af=zoltanm/bookid=322/userid=2/show'... 1 sec Document redirects to 'http://www.amazon.com/specials/322/' Requesting 'http://www.amazon.com/specials/322/'... 2 sec Read 25,154 bytes
The person following the link notices only a slight delay and then sees the final page. They are often unaware that redirection is occuring. In this example, the first link is to a tracker/redirector CGI script on the author's site, perhaps one that logs which links people follow. The next two redirects take the visitor through a cookie-setting system which is used to track affiliate sales and pay referral commissions to the author. The final request is for the actual HTML file about the book and the associated order form.
FDSE Default Behavior
By default, FDSE will record only the final URL and the text found at that final URL. It will tag the first URL and intermediary URL's as errors because there was not text content associated with them; if any records are in the index with those URL's, those records will be removed.
In the example above, that means only the final URL http://www.amazon.com/specials/322/ will be shown to visitors in the search results. When they follow that link in the search results, they will be short-cutted to their final destination, without being tracked and tagged by the various redirecting URL's.
There are three reasons for this behavior:
Cuts down on spamming. A frequent spamming technique is to have hundreds of unique URL's which get indexed, and then each URL redirects visitors to a single destination site. The spammer has an unfair advantage in that his one single site is listed hundreds of times in the index.
Faster response time for visitors. Because visitors' browsers would otherwise have to follow all of those redirects, it is faster to send them immediately to the final destination.
Most redirects are trivial; they send all visitors from one URL to the next without performing any meaningful logging or tagging work. Short-cutting traffic to the final destination is the best solution for these types of redirects.
Reasons to Override Default Behavior
Sometimes the FDSE administrator will wish for his visitors to explicitly follow all steps in the redirect path, rather than short-cutting to the final URL. For this reason the "no update on redirect" rule exists. When that rule is applied, the text, title, and description of the final URL is stored in the index, searched, and used to create the search results display; but the actual URL displayed to visitors and followed by them is the first URL at the beginning of the redirect chain.
Some scenarios where this is needed are:
When you are linking to sales pages through an affiliate program. The affiliate program needs to tag the visitor with cookies as they follow the redirect chain so that you will get credit for the sale.
When a redirect page detects the browser version or other client details and directs the visitor to a custom page appropriate for his capabilities.
When the true URL of your site is really ugly but you have a nice redirecting front-end URL. i.e., your public URL is http://come.to/mama/ but it redirects to a final URL of http://members.aol.com/fjdsak/fjdks/asfdjkl/afd/. In this case you might prefer that all visitors are familiar with only your front-end URL.
FDSE installs by default with an "Affiliates" filter rule that uses the "no update on redirect" action for amazon.com and linksynergy.com links.
How to Override Default Behavior in Some Cases
To create a "no update on redirect" filter rule, follow these steps:
Log in to the "Admin Page".
Choose the "Filter Rules" link from the navigation menu. From that page, choose Filter Rules - Create New Rule.
You will be taken to the Create or Update Rule page. Enter these values:
Name: My Rule
Enabled: [x] (checked)
Action: (*) Do not update URL during redirects
Analyze: (*) URL
Minimum Occurrences: 1
(*) Apply rule only...
Strings: (enter URL's or hostnames here)
Scope: (*) Apply to all realms
Click the "Save Data" button to save your new rule.
Test the rule by going to "Admin Page" and adding a redirecting URL in the "Add New URL" form. Make sure the original URL has been listed in the filter rule "strings" section that was just created. After indexing, the original URL should be retained.
Note: the "no-update" type of filter rules can analyze only hostnames and URL's, not the document text. This is because these types of rules are triggered before the document is actually received, and because redirecting URL's do not have any document text to speak of anyway.
The hostnames and URL's that you enter in the "strings" or "pattern" sections are applied to the initial URL, not the final or intermediary URL. In the example listed above, the string "www.xav.com" would be needed to prevent the system from logging the final URL http://www.amazon.com/specials/322/. Entering "affiliates.amazon.com" as a string would not work because the intermediary redirect URL's are not analyzed, only the first URL.
How to Override Default Behavior in All Cases
To make it so that FDSE always indexes the first URL in a chain, rather than the final one, simply create a "no update on redirect" rule, set it to analyze the "URL", and enter the string "http" in the "strings" section. This rule will thus apply to all documents.
Definition of Redirect
FDSE will treat any HTTP 300-series response, accompanied by a Location: header, as a redirect.
It will also treat as a redirect any HTML file containing a "meta refresh" with a trigger time of less than 10 seconds. The search for a "meta refresh" header is done against the first 4096 raw bytes of the file, and can generate false positives on refresh tags that are commented out or that are buried in conditional SCRIPT blocks. If you need to override the parsing of "meta refresh" tags, you can edit subroutine
process_text in library "searchmods/common_parse_pages.pl".
All documents are analyzed to see whether the "no update" rule should be applied. The default state is to not apply the rule. If any single filter rules causes the "no update" bit to be set, then it will stop processing more rules and keep the "no update" bit. This means that it is not possible to have a general "no update"-true rule applying to all pages at www.xav.com and then to add a more specific "no update"-false rule applying to www.xav.com/scripts. If you need to do something fancy, you can use the "patterns" section instead of the "strings" and then you will have all the power and versatility of Perl regex.
The "fdse-index-as" feature is implemented under-the-hood as a redirect. If a "no update" rule applies to the original URL, then any "fdse-index-as" header in the document will be ignored.
"Filter Rules: Using the "no update on redirect" rule" http://www.xav.com/scripts/search/help/1124.html