Home > Fluid Dynamics Search Engine > Help > 1175

How to index all documents linked from a site

This help article describes how to create a realm which indexes all off-site documents linked from a particular web site. The realm will contain only the individual documents that are linked -- it will not index the entire web site of the linked document.

For example, consider a web site with two files as follows:

---- contents of http://xav.com/ --
<a href="about.html">about</a>
<a href="http://www.yahoo.com/">yahoo.com</a>
-- eof --

---- contents of http://xav.com/about.html --
<a href="/">index</a>
<a href="http://msdn.microsoft.com/scripting/">MSDN scripting</a>
<a href="http://www.whitehouse.gov/president/">President</a>
-- eof --

The "link realm" will contain these three documents:

http://www.yahoo.com/
http://msdn.microsoft.com/scripting/
http://www.whitehouse.gov/president/

Creating a new realm

Follow these instructions to set up a link realm:

  1. Install FDSE version 2.0.0.0061 or newer.

  2. Navigate to Admin Page => Manage Realms => Show Advanced Commands and set to 1 (checked).

  3. Navigate to Admin Page => Manage Realms => Create New Realm. Create a realm with the following parameters:

    Name: Links

    Type: Website Realms - Crawler Discovery

    Base URL: http://www.mysite.tld/

    Pattern: .

    (That's a period "." for the Pattern.)

  4. Save the realm configuration but do not choose the "rebuild" link.

  5. Go to Admin Page => Filter Rules => Create New Rule. Create a rule with the following parameters:

    Parameter Value
    Name: Parse entire site, but do not index
    Enabled: [x]
    Action: (*) Follow links, but do not index document
    Analyze: (*) URL
    Occurrences: 1
    Logic: (*) Apply rule only if...
    ( ) Always apply rule, unless...
    Strings: http://www.mysite.tld/
    Patterns:
    Scope: [x] Apply only to these specific realms:
    (*) Links
  6. Next, create another filter rule with the following parameters:

    Parameter Value
    Name: Index off-site links
    Enabled: [x]
    Action: (*) Index document, but do not follow links
    Analyze: (*) URL
    Occurrences: 1
    Logic: ( ) Apply rule only if...
    (*) Always apply rule, unless...
    Strings: http://www.mysite.tld/
    Patterns:
    Scope: [x] Apply only to these specific realms:
    (*) Links
  7. After saving these two Filter Rules, return to your admin page and click "rebuild" next to the "Links" realm.

    Note that the first page in the "rebuild" action will return "Error" due to the "noindex,follow" rule, but don't worry -- just continue with the indexing process until it is complete.

This realm configuration is essentially a web crawler realm that is designed to search the entire Internet (that is the effect of using "." as the advanced Pattern). The first filter rule tells the crawler not to save local pages to the index, but to extract links from them. The second filter rule allows off-site pages to be indexed, but prevents further links from being extracted from them.

Rebuilding an existing realm

This realm type is somewhat "expensive" to rebuild because your entire site must be analyzed each time to find if any links have changed.

This realm also will contain a "memory" of all linked pages. If new links are added to your site, you can just use the "rebuild" command to have them included. However, if some links are removed from your pages, and you want them to be removed from the "Links" realm as well, you need to Delete the realm and then create a new one with the same Name, Base URL, and Pattern. The two filter rules do not need to be recreated. Doing the full Delete and re-create will ensure that all old links are removed from memory, so that the new index only contains fresh ones.

Miscellaneous

In the special case where all off-site links appear on a single page, you can create a "File-Fed Realm" and use the single links page as the Base URL. Always use the File-Fed realm when possible, since it does not require the advanced commands nor the special Filter Rules. It will automatically do the right the during rebuilds, so a manual delete of the realm isn't required for each rebuild.

There is not a way to build full website realms for all websites linked from a given site or file. FDSE can only index the individual documents that are linked.


    "How to index all documents linked from a site"
    http://www.xav.com/scripts/search/help/1175.html