Home > Fluid Dynamics Search Engine > Help > 1087

How to automatically rebuild the index

This article describes how to rebuild the search indexes (realms) automatically.

Rebuilding automatically requires a scheduling agent, such as Unix "cron" or the Windows Task Scheduler. FDSE is entirely event-driven and cannot preform scheduled work on its own. It needs a separate scheduling agent to drive it.

Summary

FDSE ships with command-line utility at search/searchmods/powerusr/cmd_admin.pl. This utility exposes a command-line interface for rebuilding all indexes. To schedule automatic rebuilds, simply call this utility with the appropriate parameters from your task scheduler.

Step-by-step instructions

Follow these steps to automatically rebuild the index on a schedule.

  1. System requirements

    Find out whether you have permission to run scheduled tasks on the web server where the search engine is installed.

    Ask your web host the following questions:

    • Do I have permission to run scheduled tasks? (All servers support scheduled tasks, using "crontab -e" on Unix and "Control Panel => Scheduled Tasks" on Windows. Your web host may or may not allow you to use these services, however.)

    • How are scheduled tasks managed? Where do I enter the command line for each task? How can I remove or edit existing command lines? (Most web hosts have disabled access to the standard scheduling tools, and have replaced them with proprietary web-based control panels.)

    • What sort of reporting will occur when my scheduled task runs successfully? What about when the scheduled task fails? (Unix "cron" sends an email with the output of each command, which is nice. The Windows Task Scheduler has no built-in reporting. Either could be customized by the web host though.)

    • Do you have resource limits in place for scheduled tasks? Will the processes be killed when they exceed a certain threshold?

    • What is the full file system path to my web root?

    You need to fully understand all aspects of the above questions before you can continue with FDSE automation.

  2. Check your extension

    Check the file "search/searchmods/powerusr/cmd_admin". It will be there as "cmd_admin.pl" or as "cmd_admin.cgi". If you have "cmd_admin.cgi", then use that filename everywhere in this document when we refer to "cmd_admin.pl". The file is the same in either case; some web hosts just require one extension over the other. "cmd_admin.pl" is the standard for use in documentation.

  3. Create test command file

    You need to create a single text file which contains all of the commands that will be run. The file should have the *.bat extension on Windows web servers, and the *.sh extension on other servers, i.e. "rebuild.bat" or "rebuild.sh".

    This file will contain your FDSE admin password in clear text. To be safe, save this file in a folder outside of your web root. If you cannot save files outside of your web root, then at least give the file a very unique filename, like "secret_1234_rebuild.bat".

    Windows:

    Here is what you should put in the command file if you have a Windows web server. You will need to customize the path and password to match your system:

    cd /d e:\webroot\www.xav.com\search\searchmods\powerusr
    perl cmd_admin.pl Password=password listrealms 1> log.txt 2>&1

    Unix, Linux:

    Here is what you should enter for a Unix or Linux web server:

    cd /webroot/www.xav.com/search/searchmods/powerusr
    perl cmd_admin.pl Password=password listrealms

    If your task scheduling software is not configured to email you the output, then you should redirect the output to a log using:

    cd /webroot/www.xav.com/search/searchmods/powerusr
    perl cmd_admin.pl Password=password listrealms 1> log.txt 2>&1

    On some systems, you will need a special line at the top of your file:

    #!/bin/sh
    cd /webroot/www.xav.com/search/searchmods/powerusr
    perl cmd_admin.pl Password=password listrealms

    On some systems, you may need a ./ before the command file:

    cd /webroot/www.xav.com/search/searchmods/powerusr
    perl ./cmd_admin.pl Password=password listrealms

    After you upload your rebuild file to the server, make it executable but only readable by you (chmod 700 "-rwx------"). This will allow the scheduling agent to run the file, while making it less likely that other users will be able to read the contents to discover your FDSE admin password.

    Note: if you have renamed the main script to something other than "search.pl" or "search.cgi", you will need to edit "cmd_admin.pl" and enter the new filename. The command-line utility functions by calling the main script. Also, do not allow both "search.pl" and "search.cgi" to exist from different versions. The utility may call the wrong one and fail.

  4. Run test command file

    Schedule your command file to be run using your server's scheduled task interface.

    When it finishes running, check the output of the command file. This will either arrive via an automated email from the server, or it will be written to the log.txt file in the powerusr folder.

    • If there is no output, double-check your file system paths within the command file. Double-check that you have entered the correct file system path to the command file within your task scheduling interface. If you continue to have problems, you will need to contact your web host.

    • If the output contains a full HTML page for the admin "login" action, then the password your provided on the command line is not correct.

      Passwords with metacharacters should be URL-encoded. For example, password "bob&sue" must be encoded as "bob%26sue" because the ampersand is a shell metacharacter.

    • If the command is successful, the output will look like this:

      Local folder is 'e:\xav.com\search\searchmods\powerusr'; attempting to chdir()
      Realm: www.pair.com
      Realm: www.xav.com

    If you get the success output, then great. We can now move to the next step of doing a rebuild.

  5. Create rebuild command file

    Edit your original command file and find the line:

    perl cmd_admin.pl Password=password listrealms

    Replace this with:

    perl cmd_admin.pl Password=password rebuild All
  6. Run rebuild command file

    Use your scheduler to make the rebuild command file run every day or so. Your indexes will now be automatically kept up to date.

    You will still be able to use the web-based admin page to force a rebuild when needed.

  7. Optimize rebuild commands with staggered calls

    If you are indexing a small site, then you can just use the "rebuild All" command and you will be fine.

    When indexing large data sets, or any time when your scheduled task may be killed, you should use a staggering strategy. This strategy replaces a single long-running process with dozens of smaller processes which are separated in time. Each staggered process has a small, manageable workload.

    The first part of this strategy is to use the DaysPast parameter. This parameter tells FDSE to only re-index documents older than the given number of days. An example of this syntax is:

    perl cmd_admin.pl Password=password rebuild All DaysPast=3

    Next, instead of calling your scheduled task daily, or every three days, you call it frequently, like every four hours.

    In this case, every four hours when FDSE runs, it will search for any URL's which haven't been re-indexed in the last 3 days. If they've all been recently indexed, it will immediately exit and wait for another 4 hours. It uses very few resources for this operation, and so it is safe to run it every few hours. Other times, FDSE will be called and will find URL's that need to be re-indexed. It will re-index them. It will then index any new URL's which it found while indexing. If it is killed before it can complete, that is okay -- the next process that starts in four hours will be able to continue where it left off.

    Using this strategy, each scheduled task only needs to accomplish about 5% of the workload (4 hours worth / 3 * 24 hours). The system will be able to auto-recover if some task fail early on. All content in the indexes will be 0 to 3 days old.

    You can adjust the 4-hour, 3-day intervals to tune the workload size and maximum content age. The DaysPast parameter supports decimals, i.e. DaysPast=0.5. (Note that unfortunately, decimal support was broken in FDSE builds 0064 through 0070 -- only integers worked in those builds.)

Dealing with timeouts, kills and hung processes

The web server may, at its discretion, kill the scheduled task at any time. The web server will usually do this if the task uses too much time or too many resources. In addition, the indexing process itself runs using the web crawler which makes synchronous network calls. These can hang if the crawler is connecting to a unresponsive server. Any of these problems will stop the rebuild action.

To handle this, the process saves results to disk with every 10th request. For example, if the rebuild process indexes 57 documents and is then killed, then the first 50 documents will be saved to the index. The remaining 7 would have been in-memory when the process was killed. They would be lost.

If you experience timeouts, kills, or hanging, it is possible to restart the indexing from where it left off, but only by using the staggered calls strategy described above.

Dealing with overlapped processes

The FDSE data files are write-locked, and so it is unlikely that there will be any data loss if two admin tasks operate at the same time. Examples of multiple simultaneous admin tasks would be one scheduled task running while you are using the web-based admin page, or two scheduled tasks running at the same time.

Due to the aggressive write-locking, however, an admin task may hang for a long time while waiting for the other admin task to finish its work. Also, all searches will hang while the data files are write-locked. Thus, it is best to plan tasks so that there will never be more than one process trying to update the data files at any given time.

History: the cmd_admin.pl script is available in FDSE version 2.0.0.0041 and newer.


    "How to automatically rebuild the index"
    http://www.xav.com/scripts/search/help/1087.html