How to make indexing faster
The amount of time it takes to index your documents depends on several factors. Here are some suggestions for speeding up the index process.
-
Use command-line indexing
The command-line indexing process is about 20 times faster than the web-based indexing process. It is also much more stable, since it sidesteps the CGI resource limits imposed on the web-based indexing process.
See How to use FDSE from the command line and How to automatically rebuild the index for examples.
The command-line interface is only available to those who have telnet or console access to their web server.
-
Use Internet Explorer with Javascript
The web-based indexing process contains special Javascript tags which are only recognized by Internet Explorer. These tags allow the web-based indexing process to automatically restart if it suffers a CGI timeout. CGI timeouts are a frequent problem when rebuilding large index files and normally they cause the process to stop until a human can intervene and restart it. With Internet Explorer, the process can auto-recover and thus survive better in an unattended fashion.
Note that this only refers to the preferred browser configuration for the administrator who is rebuilding the index. Actual visitors to your search pages can use any browser. And any browser will still work for rebuilding the index, just not as well.
-
Use "file system discovery" rather than "web crawler discovery"
When indexing web sites which are located on the same physical web server as the FDSE script, you can use "file system discovery" instead of "web crawler discovery". Accessing files directly via the file system is faster and more efficient than going over the network.
See Admin Page => Manage Realms => Create New Realm to review all the different options available for each realm, including the discovery method.
-
Optimize settings
The General Setting "Crawler: Max Pages Per Batch" controls the maximum number of documents that will be processed before the live index file is updated. Updating the live index file is a time-consuming process due to the search-and-replace nature of the update, and also because the indexing process needs to wait for all search processes to finish reading before it is allowed to update. Thus, maximizing the documents per batch increases the efficiency of the overall process.
The General Setting "Timeout" is similar to "Max Pages Per Batch" but limits the clock time of the batch, rather than the document count. Experiment with setting each value very high.
Note that, in between writing to the live index file, the indexed documents are stored in memory. If the pages per batch is very high - like more than a few hundred - then server memory may be used up and the process will fail or page to disk, which will be very slow. Thus, the total documents per batch should not be increased without limit.
-
Optimizing Pages Per Batch - File System Discovery Realms
File System Discovery realms always write directly to a temp file while rebuilding, rather than to the live index file, and so they do not share the slow update problems found in crawler realms. Also, because they write directly, the memory consumption is much lower. The General Setting "Timeout" is used to throttle the indexing across multiple CGI processes to prevent a web server time-out. Setting the time-out to a high value will save some time, since there is an automatic sleep of 15 seconds between each process.
-
Use "Revisit Old" instead of "Rebuild"
The "Rebuild" command will re-index every document in the index.
The "Revisit Old" command is more selective.
For realms that use the file system, the "Revisit Old" command will only re-index documents that have been updated since the index was last built. Also, any new documents will be indexed, and document which no longer exist will be removed from the index.
For realms that use the crawler ("website - web crawler", "open", and "file-fed" realms), the "Revisit Old" command will only re-index web pages that have not been visited in the last 30 days *. That command will also index any web page that is in the queue waiting to be indexed but has never been indexed before. So, if you are building a website realm for the first time and then the indexing process fails, you should continue using the "Revisit Old" command. That way you will still get all the new links as they are discovered but you won't overwrite the work you've accomplished so far.
* The actual number of days used to decide whether a document is "old" is governed by the General Setting "Crawler: Days Til Refresh" which has a default value of 30. When recovering from a failing rebuild, it might be helpful to set this value to 1.
-
Selecting a Realm Type
Indexing with the File System Crawler is very fast compared to the Web Crawler. Use "Website Realms - File System Crawler" to maximize the speed of indexing local sites. The File System Crawler can also detect which files have changed when rebuilding the index, allowing it to index only updated files (this algorithm is used with the "Revisit Old" command). The Web Crawler must always re-index every file in the realm.
-
Reduce the total number of documents indexed
Any document that is not likely to be useful to visitors should be removed from the index. From the Admin Page, choose "Review" to see which documents are currently indexed. Documents which are not useful can be permanently removed by choosing "Delete" and then selecting option 2, "add to forbidden sites list".
The "Forbid Pages" filter rule and the robots.txt file are more efficient at blocking pages than the Robots META tag, since parsing the META tag requires the crawler to first access and parse the document.
-
Optimizing Document Size
The General Setting "Max Characters: File" allows you to determine the maximum number of bytes read from any document. Keeping this setting at a low value, like 64000 or 32000, will save time during the index process at the expense of some accuracy in searches.
-
Optimizing Realm Architecture
Realms are used to group web pages together for indexing purposes. When possible, it is best to group pages based on the necessary frequency of re-indexing. For example, if you have a single web site with 10,000 documents, of which 1,000 change daily, you could create two realms covering each group. Then create a task for daily re-indexing only the smaller realm.
-
Trade-Offs Between Index and Search
In many cases, investing extra time while indexing can save time while searching. In every case where this trade-off is available, FDSE has taken it, since indexing is only done once every day or so, but searches are performed thousands of times each day. For example, having a large set of "Ignore Words" slows down the index process because it has to parse each word from each document. However, the result is a much smaller, more quickly searched index file, and so all resulting searches are faster, and the overall CPU utilization of the web server will be minimized.
The following features will use more resources during indexing to save time during searching:
- Ignore Words
- Filter Rules
- Character conversion settings like Accent Sensitive and English Language Searching
Please contact Fluid Dynamics with any suggestions you have about how to best build your index files.
"How to make indexing faster"
http://www.xav.com/scripts/search/help/1093.html