Home > Fluid Dynamics Search Engine > Help > 1010

Preventing duplicate records

Here are three common sources of duplicate records within FDSE search results:

  1. The same exact URL is present in multiple realms
  2. Different URL's point to the same web page, in the same realm
  3. The same URL is present twice in the same realm, once as "/" as once as "/index.html"

1. Avoid duplicates across realms

Avoiding cross-realm duplicates is best achieved by managing your realm configuration.

Create only one Open Realm. It will hold miscellaneous files.

When creating Website Realms, create two Filter Rules with each new Website Realm. For example, I could create Website Realm "xav.com" to handle the site "www.xav.com". I create these two associated Filter Rules:

  1. Filter Rule named "xav.com - allow". It is set to "Always Allow" pages if the "Hostname" matches string "xav.com".

    I set the Scope property to apply only to the specific realm named "xav.com".

  2. Filter Rule named "xav.com - deny". It is set to "Deny" pages if the "Hostname" matches string "xav.com".

    I set the Scope property to apply to all realms.

Now, since "Always Allow" overrides "Deny", the only way xav.com pages can be added to the index is if they are added to my "xav.com" website realm. If a visitor tries to add an xav.com page to an open realm, he will receive an error ("denied due to hostname"). Without the filters, the visitor's addition would have succeeded and would have caused a duplicate record in the search results.

Note that with this approach, there may still be duplicates among multiple Open Realms and File-Fed Realms. Using only a single realm of these types will avoid the problem.

2. Avoid duplicates within a single realm

FDSE will internally prevent literal duplicates within a single realm. Problems arise for pages with many different URL names, such as:

All of these would be indexed separately by FDSE and all would be returned by a search for "notify".

One good way to control this behavior is with the FDSE-INDEX-AS META header. Add the following META header to the web page:

<meta name="fdse-index-as" content="http://www.xav.com/notify/" />

Now, whenever the file is indexed, the URL will be updated to a consistent value.

See also Support for FDSE-Index-As META header and Preventing duplicates for sites with multiple names (i.e. foo.com = www.foo.com).

3. Avoid duplicates due to default document (/ == /index.html)

You should design your links as:

<a href="http://xav.com/">xav.com</a>   # recommended
<a href="http://xav.com/index.html">xav.com</a>   # NOT RECOMMENDED

It is true that both formats work, but the second format could cause huge problems in the future if you decide to switch to index.shtml or index.php, or whatever. Consider the problems: it would be labor-intensive, but possible, for you to change all of your links; however, there would be other sites and search engines linking into the old format. You would be losing traffic from those sources, or you would have to set up and maintain redirect documents.

Therefore, the following link format is best:

<a href="http://xav.com/">xav.com</a>   # no default document listed

The tighter format also saves bandwidth.

This approach will not work if your site is served by different servers which support different default names, or if your site is served by the filesystem, which has no default document. In those rare cases, it is best to always use the explicit name. At least you will achieve consistency. Older versions of IIS also require an explicit default document to transfer query data (/?foo != /default.asp?foo), and so you'll need the explicit default document there as well.

As an alternative to rewriting your HTML files, you can add a URL-Rewrite Input Filter to strip explicit default documents. Go to "Admin Page" => "Filter Rules" => (scroll down) "URL Rewrite Rules" => "Input Filters". Create a filter which matches pattern "/index.html$" and replaces it with "/". Note that the "$" is required in the pattern to ensure that the pattern only matches at the end of the URL. Similar patterns could be created for other default document names.

    "Preventing duplicate records"