How to ignore common search terms

The search engine supports "ignore" words as a way of minimizing the index file sizes, and avoiding excessive hits when a user types in a common search term like "the". The set of ignored words is found under "Admin Page" => "General Settings" => "Ignore Words".

The ignore list works as follows: when indexing a file, the indexer loops over all ignored words and removes them from the text. This process takes longer for more ignored words, and for a larger original file. However, this is an admin process and so it won't affect end users. The text that gets saved to the index file is minimal.

For each set of search terms, all ignored words are removed. This process is very fast, since the set of search terms is typically only 20-30 characters at most. Once all ignored words are removed from the search terms, those terms are compared to the stripped down records in the index files.

This will cause some false positives on phrase matching. For example, if the words "the" and "is" and "a" are ignored, the a search phrase "bob the barber" will be translated to "bob barber", and will fire hits on documents that contained "bob the barber", "Mr. Bob Barber", and "bob is a barber".

