Home > Fluid Dynamics Search Engine > Help > 1109

Preventing duplicates for sites with multiple names (i.e. foo.com = www.foo.com)

Many web sites can be addressed via multiple hostnames. "www.xav.com", for example, can be addressed as "xav.com", "nickname.net", "ftp.xav.com", "ww.xav.com", and "209.68.17.186". In these examples, "www.xav.com" will be the hostname that I would prefer to use in all cases, and so will be called the "authorative hostname". The other formats will be called "non-standard hostnames". The goal will be to internally re-write all URL's containing non-standard hostnames so that they use the authorative name.

The existence of alternate names creates the following problems:

  1. If the web crawler is exposed to the document using different hostnames, it will think they are different documents because the literal URL differs. This may lead to duplicates in the index.

  2. When indexing a single site, the crawler will only follow links that pattern match to "http://$hostname/". If some of the links on the web site point to other documents on that same web site using a non-standard hostname, then the crawler will mistakenly treat the link as an off-site link and will skip it.

As of FDSE version 2.0.0.0054, various hostnames can be easily resolved to a single authorative hostname by using an "Input URL Rewrite Filter".

To create a rewrite filter:

  1. Make sure you are using FDSE version 2.0.0.0054 or newer.

  2. Go to "Admin Page" => "Filter Rules" => (scroll down) "URL Rewrite Rules".

  3. The first section is labeled "Input Filters". To create a basic input filter which maps http://xav.com/ to http://www.xav.com/, enter:

    Enabled Verbose Pattern Replace
    Comment:

    The caret ^ is used at the beginning of the pattern to force the pattern to match at the beginning of the URL.

  4. A more complex input filter which maps all possible names to the authorative hostname would be:

    Enabled Verbose Pattern Replace
    Comment:

    The parentheses and pipes are used to group and separate different strings that can all match in a certain place.

To remove an input filter, simply delete the "Pattern" field and then submit the "Save Data" button.

An alternate way to correct this problem on a document-by-document basis is to use the index-as META header; see Support for FDSE-Index-As META header.


The following help describes how to solve this problem in FDSE version 2.0.0.0053 and older.

Working around this problem requires customizing the FDSE source code. The source code below is from FDSE version 2.0.0.0050; code from other versions will be similar.

Before you make this customization, you must remove all URL's that use the non-standard hostnames from FDSE data files. (Those URL's must be removed from all index files and from the search.pending.txt file. If you do not take this step, then FDSE may fall into an infinite loop and continually try to re-index one of the non-standard URL's.)

Edit subroutine "parse_url_ex" in library "searchmods/common.pl". That subroutine contains this code:

($host, $port, $path) = (lc($1), $2, &clean_path("/$3"));
$port = 80 unless $port;
if ($port == 80) {
	$clean_url = "http://$host$path";
	}
else {
	$clean_url = "http://$host:$port$path";
	}

Every single URL handled by FDSE is passed through this subroutine, and FDSE will pay attention only to the output $clean_url. Thus this subroutine is the logical place to re-write URLs. Replace the code above with:

($host, $port, $path) = (lc($1), $2, &clean_path("/$3"));
my %maps = (
	'xav.com' => 'www.xav.com',
	'nickname.net' => 'www.xav.com',
	'www.nickname.net' => 'www.xav.com',
	'209.68.17.186' => 'www.xav.com',
	'yahoo.com' => 'www.yahoo.com',
	'www3.yahoo.com' => 'www.yahoo.com',
	);
$host = $maps{$host} || $host;
$port = 80 unless $port;
if ($port == 80) {
	$clean_url = "http://$host$path";
	}
else {
	$clean_url = "http://$host:$port$path";
	}

Each entry in the %maps hash is of the form "non-standard-name => authorative-name". Now, whenever FDSE sees a new URL, whether from user input or by extracting it from an HTML page, it will re-write the hostname to use the authorative name.


    "Preventing duplicates for sites with multiple names (i.e. foo.com = www.foo.com)"
    http://www.xav.com/scripts/search/help/1109.html