How to make searches faster
Search time is proportional to the number of records to search, the size of each record, and the number of queries done against each record. Searches can be made faster by:
- Minimizing the total number of records
- Minimizing record size
- Minimizing the number of queries against each record
- Increasing Size of Result Set
- Minimize Writes to Index
1. Minimizing the total number of records
This primarily involves the removal of duplicate and irrelevant entries. A good way to do this is to use the "Review" command, and then the "Delete" command on unneeded records. Duplicates will often appear due to default docs and redundant hostnames - like "http://xav.com/" and "http://xav.com/index.html" and "http://www.xav.com/" could all be indexed separately, even though they're the same file. FDSE will internally prevent literal duplicates, but only on a per-realm basis.
If there are common patterns among the duplicates appearing, you can write a Filter Rule to avoid them. For example, a filter rule called "Strip Default Names" could be set to Deny any URL which ends in "index.html|index.shtml|index.cgi|default.htm". As long as there are links to both "xav.com/" and "xav.com/index.html", then this filter rule will not exclude anything that's not already present in the index.
2. Minimizing record size
A large list of "Ignore Words" is a good way to minimize record size. The list has grown in recent releases. You can check under "General Settings" to make sure you're using the recommended default.
In addition, you should look at "Admin Page" => "General Settings" => "Max Characters: *" settings. You can reduce "Max Characters: Text" from 64000 to 4000, and thus only save the first 4000 bytes of text on a page for searching. If your average page size is 8kb of text, then this will effectively double the speed of searches. In many cases, the most relevant keywords are used in the beginning of the document, and so search accuracy won't suffer too much from this adjustment.
After making changes, you will have to rebuild the index.
3. Minimizing the number of queries against each record
FDSE generally performs one query per record per search term. In some cases it will perform fewer, in some cases more.
The search algorithm is optimized to stop executing queries if any forbidden term is present, or if any required term is not present. For each record, the algorithm first checks any/all forbidden terms, then any/all required terms, and finally any/all optional terms. Since most users don't use the "+term1 +term2 -term3 |term4" syntax, each term in a multi-term search is typically considered "required" or "optional" based on the Default Match setting of "Any Term" or "All Terms". Changing the default from "Any Term" to "All Terms" will cause searches to be more restrictive and return faster. This may or may not improve the usefulness of the result set.
The search terms pre-processor will remove any "Ignored Word" from the search string. A search for "what is the capitol of iowa?" will ignore "what is the of" and just search on the remaining two terms, "capitol" and "iowa". Thus the default Ignore Words set would reduce the queries per record by two thirds. Expanding the Ignore Words array is a good way to speed up searches.
If any of the "Multiplier: Description | Keyword | Title" settings is non-zero, FDSE will perform additional queries per record per search term to see if the term appears in those special fields - if they do, it will increase their relevance ranking by the appropriate amount. While this increases the usefulness of the results, it makes the search process much more expensive, so leaving these values at 0 is good for performance.
The setting "Show Examples: Enable" causes additional queries to be done that will display where in the document the search term was found. These context displays improve the usefulness of results, but the cost of additional queries will slow down the system. Disabling this feature will allow FDSE to optimize the search algorithm.
4. Increasing Size of Result Set
Whenever a search is performed, FDSE queries all records, and develops a relevance-sorted list of results. Then, if the user wants results 1-10, FDSE provides that slice of the result set. If the user wants results 11-20, FDSE provides that slice. If a user views results 1-10 and then hits "Next" to see results 11-20, a brand new query is performed.
By increasing the "Hits per Page" from 10 to 25 or so, each user will have fewer "Next" hits, and the overall resource utilization will be lower. The cost of returning 25 results instead of 10 is just the cost of printing the extra records to screen, which is extremely cheap compared to the cost of a whole new query.
5. Minimize Writes to Index
When FDSE updates the index file, it places an exclusive lock on the file to prevent any other processes from reading. All read process enter a sleep loop to wait for the index file to complete updating. They will wait up to 30 seconds before timing out. Once the update is complete, the read processes proceed.
Index update time is normally a second or less, but it scales linearly with the size of the index file. For index files in the tens of megabytes, the update time can be in the tens of seconds.
On busy websites with multiple simultaneous users and large index files, a large bottleneck can develop whenever an update occurs. Thus, minimizing the frequency of updates is critical. On busy sites, allowing anonymous visitors to add their own URL without waiting for approval will cause the index file to be under an almost constant write lock and searches will take a long time. A better solution is to enable "Require Approval" for all anonymous additions, and then periodically just approve all additions, perhaps once per hour or once per evening. When "Require Approval" is enabled, the new additions are appended to a temp file and do not lock the index.
See also: Building index files remotely
"How to make searches faster" http://www.xav.com/scripts/search/help/1032.html