Home > Fluid Dynamics Search Engine > Help > 1061

URL rewrite limitations with file system discovery

FDSE supports three ways to rewrite URL's:

  1. A document can contain an "fdse-index-as" META header. If present, any URL used to retrieve the document will be rewritten to the URL string in the META header. This is useful to force a consistent name on documents accessible via multiple URL names. See Support for FDSE-Index-As META header.

  2. The URL Rewrite "input filters" are used to rewrite URL's as they are extracted from HTML. These filters are useful for forcing consistent URL names, such as when a link may appear as http://xav.com/foo or http://www.xav.com/foo.

  3. The URL Rewrite "output filters" are used when the URL for indexing is different than the URL visitors should follow. Output filters are used for indexing private HTML-ized versions of binary files, or for indexing SSL content.

These rewrite methods were designed for use with web crawler discovery, which is used by FDSE's default realms. The rewrite methods work differently with file system discovery, which is used by "Runtime Realms" and "Website Realms - File System Discovery". Those differences are described here.

The FDSE-Index-As Header

This META header is parsed for documents discovered using the file system. For normal builds of the index, everything will work as expected.

However, incremental builds -- those that re-index only the files that have changed, using the "revisit old" link -- will not work properly. The incremental builds fail because the indexer builds a fresh list of all files in the file system, and compares the list with all URL's in the index file. The indexer will detect that the file containing the META header is missing from the index, and so it will try to re-index it. The indexer will also detect that a new URL is present in the index (from the META tag), and it will try to drop this URL. As a result of this confusion, the "revisit old" action will cause all documents to be re-indexed, rather than only those that have changed.

To avoid this problem, you should either not use "fdse-index-as" with file system realms, or you should not use the "revisit old" action link. Note that the URL Rewrite output filters provide functionality similar to "fdse-index-as", without the problems.

Input Filters

Input filters are not applied to file system discovery realms at all, as they are only applied when links are extracted from HTML pages. File system indexing does not parse HTML files for links.

Output Filters

Output filters work equally well for web crawler indexing and file system indexing. These filters are managed by going to Admin Page => Filter Rules => URL Rewrite Rules.

In the above example with "license_doc.html", an output filter could be created to map http://xav.com/x/license_doc.html to http://xav.com/license.doc. This filter would be applied at the very last moment, as the search results are displayed to the visitor. The actual index file would always use the original private URL, and so there would be no problems with the alphabetic order of the index file.

Output filters can also be used when indexing SSL web sites; see Indexing secure SSL / https web pages.

    "URL rewrite limitations with file system discovery"