Home > Fluid Dynamics Search Engine > Help > 1097

Administration: Creating a realm configuration

General Recommendations

The following realms configuration should work for most people. It assumes that FDSE has indexed a small collection of web sites and documents about a single subject. It assumes the majority of queries will search all realms:

  1. Decide on the scope of your search engine, that is, which content it will cover. Keep in mind that FDSE can handle only about 10,000 documents.

  2. For each entire web site to indexed, create a "Website - Crawler Discovery" realm.

  3. If there are also a few individual documents that should be searched, create a single "Open Realm" and add those miscellaneous URL's to it.

On the other hand, if FDSE is being installed to cover several separate topics, use this strategy. This assumes that most queries will be done against a specific realm in the dropdown select box, and that the "All" choice might even have been removed:

  1. Decide on the scope of your search engine, that is, which content it will cover. Keep in mind that FDSE can handle only about 10,000 documents per search. Thus, each topic area can safely hold this many documents, if the option to search all topics is removed.

  2. For each subject area, create an open realm.

  3. Go to "Admin Page" => "Manage Realms" => and enable the setting "Allow Index Entire Site".

    (In FDSE version 2.0.0.0056 and newer, that setting has been renamed to "Show Advanced Commands".)

  4. Use the "Add New URL" form to add each site or document to the open realm dedicated to that subject area. For sites, check the checkbox labeled "Index Entire Site". (Do not use the "Add New Site" form, since that will automatically create a separate website realm instead of using the existing open realms.)

    Note: when rebuilding each open realm, the "Rebuild" function can be used to re-index all individual documents. To completely re-index all sites, including the addition of documents added since the last rebuild, the administrator must re-add the site to the open realm using the "Add New URL" form with "Index Entire Site" enabled.

Other types of usage may require special configurations of realms, but the two configurations above should work for most.


Types of Realms

FDSE uses the concept of "realms" to organize groups of documents. Each of the six realm types is designed for a different purpose.

  1. Open Realm

    The open realm can contain a variety of documents. It is a flexible realm. If you enable visitor-added URL's, then an open realm must be created to accept those URL's.

    This realm type is intended to include an assortment of documents from various web sites. If you want to include all documents on a given web site, then one of the Website Realms would be recommended instead.

    This type of realm is not available in Freeware mode, though all others are.

  2. File-Fed Realm

    The file-fed realm contains the set of all documents directly linked from a single starter document.

    This type of realm can be used to easily index all the pages listed on your links page, for example. This type of realm is also commonly used with computer-generated starter documents as a way to programmatically define which URL's will be in an index.

  3. Website Realm - Crawler Discovery

    This type of realm includes all documents on a single site. To find all the documents on the site, the crawler extracts links from pages. Thus this type of realm contains only documents which are linked, directly or indirectly, from the main page.

  4. Website Realm - File System Discovery

    This type of realm includes all documents on a single site. To find all documents, FDSE programmatically opens folders and scans the list of files in each folder. Thus, all documents are included, whether they are linked or not. This type of realm can only be used for web sites on the same physical server as the FDSE search engine.

  5. Runtime Realms

    Runtime realms are identical to "Website Realm - File System Discovery", except that for these, the index is rebuilt each time a visitor performs a search. Because of this, the search results are always up-to-date.

    This type of realm might be useful when applied to a small folder of documents which are updated frequently. Otherwise, this realm type is extremely inefficient and uses excessive resources.

  6. Filtered Realms

    This is an advanced type of realm for use by advanced users.

    See How and when to use Filtered Realms for more information on it.


Definition of "web site" for purposes of Realms

FDSE considers a web site to be a folder and all included documents on a given web server. The following are examples of web sites:

http://www.xav.com/
http://www.nickname.net/~foobar/
http://xav.com/scripts/search/help/

In the list of realm types, we mention the "Runtime Realm" and how is spans an entire "web site". In practice, an administrator could have a runtime realm which indexed only a few frequently-changing files in a given subfolder. The administrator could then create a more efficient Website Realm to handle all the other files and folders on the same server.


Comparison of Discovery/Access Methods

Web Crawler

Used with: Open, File-Fed, and Website/Crawler realms

Advantages:

The crawler accesses a web page just like a web browser - by making an HTTP connection, sending a request, and parsing the response. Thus, the results obtained with a web crawler tend to be more accurate.

The web crawler must be used on programmatically-generated sites, where each unique web page is designated by a query string like "default.asp?id=10" rather than "file_10.html".

Disadvantages:

Using the web crawler is more resource-intensive than simply opening a file. Rebuilding a realm using web crawler discovery will take 5-10 times longer than using the file system.

The web crawler cannot tell the difference between documents that have been updated, and those that haven't. Thus, when rebuilding realms, every document must be completely re-processed. This magnifies the already resource-intensive nature of these types of realms.

File System

Used with: Website/File System and Runtime realms

Advantages:

Discovering files using the file system is comprehensive. All files are included, not just those linked in HTML.

Navigating folders and file contents is fast and efficient.

The file system discovery method allows FDSE to query for the true last update times of files. Under certain circumstances, FDSE will short-cut past files that haven't been updated since the last rebuild of the realm. This improves the update speed.

Disadvantages:

This method is not recommended for dynamic output. In general, PHP, PL, CGI, ASP pages, etc., can be indexed this way, but only if the dynamic portion of the file is used for minor things like embedded advertisements or header/footer blocks. If the bulk of the text data itself is inserted dynamically, like by being pulled from a database, then this method shouldn't be used because it will only index the source code, and not the output, of the file.

This method will index all documents in the folder hierarchy as long as they are accessible using the file system. If a folder has been protected against web access by using an Apache .htaccess file, for example, then the search engine may still index the file, even though visitors wouldn't be able to follow the link in the search results. This may even reveal sensitive data. Administrators should review the list of pages indexed with the file system, and remove any files or folders that shouldn't be accessible to visitors. (FDSE has been designed to make this fast and easy, but it can't do the work for you.)

Each folder on Unix systems must have the "read" permission granted before the file system discovery method can enumerate its files. If the folders have only 711 permissions (execute but not read), then FDSE will omit their contents without providing any error message in the admin user interface.

History: prior to FDSE version 2.0.0.0031, there were only two types of realms, "remote realms" and "local realms", which corresponded to "open realms" and "website realms - file system discovery", respectively.


    "Administration: Creating a realm configuration"
    http://www.xav.com/scripts/search/help/1097.html