Home > Fluid Dynamics Search Engine > Help > 1182

Indexing Microsoft Word documents using runtime conversion with Antiword

This document describes how to index Microsoft Word documents by automatically extracting the text from them. Because there aren't routines for extracting text that are native to Perl, FDSE must shell out to a separate helper application. FDSE uses the free Antiword application for this.

Steps to enable:

  1. Make sure you're running FDSE version 2.0.0.0064 or newer.

  2. Begin by going to FDSE Admin Page => General Settings. Your server operating system will be listed at the top of the page, in the sentence "Web server operating system is MSWin32". Typical operating systems are MSWin32, linux, and FreeBSD.

  3. Next, go to the Antiword web site http://www.winfield.demon.nl/. If your server operating system is MSWin32, download the Windows binaries.

    If your server OS is not Windows, then download the source code and transfer it to your server. You must then compile the software, which requires a shell account (telnet or ssh). If you do not have that level of access, you may contact your web host to ask them to compile the software for you, or provide shell access. Follow the antiword instructions for compiling. A list of commands used for compiling can be found at the end of this file.

  4. Upload the binaries to your web server.

    • If you are using Windows, you will have an "antiword" folder which includes "antiword.exe" and a bunch of text files such as "8859-1.txt".

      Create an "antiword" folder as a subfolder of "search", and move the "antiword.exe" executable there.

      Next, create an ".antiword" folder inside the "search/searchdata" folder. Note the leading dot in ".antiword". The Windows explorer may prevent you from creating a folder with a leading dot; if that happens, you must open a command prompt and type "mkdir .antiword". Once the folder has been created by the command prompt, you may return to the Windows explorer to use the folder.

      Move all of the text files from the original "antiword" folder up to "search/searchdata/.antiword".

      Your directory structure will look like:

      /search/
      /search/antiword/                              # new folder
      /search/antiword/antiword.exe                  # new file
      /search/searchdata/
      /search/searchdata/.antiword/                  # new folder
      /search/searchdata/.antiword/8859-1.txt        # new file
      /search/searchmods/
      /search/search.pl
    • On other platforms, you will have a "bin" folder containing the "antiword" executable and possibly a "kantiword" file. You will then have a separate folder named ".antiword" that contains data files like "8859-1.txt".

      Create an "antiword" folder as a subfolder of "search", and move the "antiword" executable and "kantiword" file there. Then, move the entire ".antiword" folder inside the "search/searchdata" folder. Your directory structure would then look like:

      /search/
      /search/antiword/                            # new folder
      /search/antiword/antiword                    # new file
      /search/antiword/kantiword                   # new file
      /search/searchdata/
      /search/searchdata/.antiword/                # new folder
      /search/searchdata/.antiword/8859-1.txt      # new file
      /search/searchmods/
      /search/search.pl
  5. Apply read-execute permissions (chmod 755) to the antiword utility. Make sure the data files like 8859-1.txt are readable (chmod 644).

  6. Next, you need to tell FDSE that Antiword is available. Edit the file /search/search.pl and find the lines:

    %private = (
    	'antiword utility folder'  => "",
    	'pdf utility folder' => "",

    Enter the full or relative path to the Antiword folder. The "search/searchdata" folder is always the current working directory for this process, so create relative paths based on that folder. Include a trailing slash. Use the slash convention of your operating system, which is \\ for Windows and / for all others.

    %private = (
    	'antiword utility folder'  => "..\\antiword\\", # Windows example
    	'pdf utility folder' => "",
    %private = (
    	'antiword utility folder'  => "../antiword/",   # non-Windows example
    	'pdf utility folder' => "",
  7. Next, return to the FDSE Admin Page in your browser. Choose General Settings, then Binary Converters - Setup and Test.

  8. The binary converters page will list all known converters and their enabled/disabled status. Antiword should be listed as enabled, because you've customized its variable. Click on "Syntax Test" to make sure all basic tests work. If they do not, adjusts the settings, file paths and file permissions until they are working. Contact the Fluid Dynamics support forum if you need help.

  9. After the syntax test is finished, you need to test Antiword with a real Word document. The first step in indexing a real file is configuring FDSE to discover Word documents. Go to Admin Page => General Settings => Crawler: Ignore Links To and remove "doc" from the extension list. Then go to General Settings => Ext and add "doc" to that extension list. Finally, return to "Binary Converters - Setup and Test" and click the "cross-reference" link. That will verify that your general settings match the binary converters that are loaded.

  10. Next, create a realm named "Binary Conversion Test" which will include the Word documents that you want to test. Don't rebuild the realm - just create it.

  11. Return to the "Binary Converters - Setup and Test" page and reload it. There will now be a link labeled "index all files". Click it to index. When you begin indexing from this page, all Antiword actions will include debug output, so you can see what is going on. Confirm that text is properly extracted from your Word documents.

  12. Finally, once testing is done, you may delete the "Binary Conversion Test" realm and begin to index Word documents normally.

Antiword extracts text only; there are no HTML titles, META descriptions, or META keywords. FDSE will use the filename as the title in the search results, and the first few words of the content as the description.

The installation of Antiword is not automatically covered under the FDSE free custom install policy. However, on request, I will try to install and test Antiword for all customers who have purchased a registration key, and for others on a case-by-case basis. Note that Antiword might not be available for all platforms.

Known Problems: things to keep in mind if you have trouble:

See Maintain HTML copies of each written binary document for another way to index Word documents.

See also File pathname restrictions when using binary converters

Steps to compile:

Here are the commands used to compile Antiword on FreeBSD; commands are in blue. The long lynx command has been continued to two lines for readability:

[vepar]/usr/home/xav> mkdir antiword
[vepar]/usr/home/xav> cd antiword
[vepar]/usr/home/xav/antiword> lynx -source
	http://www.winfield.demon.nl/linux/antiword-0.33.tar.gz > aw.tar.gz
[vepar]/usr/home/xav/antiword> ls -al
total 502
drwx------   2 xav  users     512 Jul  3 09:38 .
drwx-----x  15 xav  users    1536 Jul  3 09:37 ..
-rw-------   1 xav  users  240684 Jul  3 09:38 aw.tar.gz
[vepar]/usr/home/xav/antiword> tar -xzf aw.tar.gz
[vepar]/usr/home/xav/antiword> ls -al
total 506
drwx------   3 xav  users     512 Jul  3 09:38 .
drwx-----x  15 xav  users    1536 Jul  3 09:37 ..
drwx------   8 xav  users    2048 Jul 10  2002 antiword-0.33
-rw-------   1 xav  users  240684 Jul  3 09:38 aw.tar.gz
[vepar]/usr/home/xav/antiword> cd anti*
[vepar]/usr/home/xav/antiword/antiword-0.33> make
[vepar]/usr/home/xav/antiword/antiword-0.33> make install
mkdir -p /usr/home/xav/bin
cp -pf antiword kantiword /usr/home/xav/bin
mkdir -p /usr/home/xav/.antiword
cp -pf Resources/* /usr/home/xav/.antiword
[vepar]/usr/home/xav/antiword/antiword-0.33> cd ~/bin
[vepar]/usr/home/xav/bin> chmod -R 755 .
[vepar]/usr/home/xav/bin> chmod -R 755 ~/.antiword
[vepar]/usr/home/xav/bin> ls -al
total 312
drwxr-xr-x   2 xav  users     512 Jul  3 09:40 .
drwx-----x  16 xav  users    1536 Jul  3 09:40 ..
-rwxr-xr-x   1 xav  users  145036 Jul  3 09:40 antiword
-rwxr-xr-x   1 xav  users     840 Jul  8  2001 kantiword
[vepar]/usr/home/xav/bin>
[vepar]/usr/home/xav/bin> antiword
        Name: antiword
        Purpose: Display MS-Word files
        Author: (C) 1998-2002 Adri van Os
        Version: 0.33  (05 Jul 2002)
        Status: GNU General Public License
        Usage: antiword [switches] wordfile1 [wordfile2 ...]
        Switches: [-t|-p papersize][-m mapping][-w #][-i #][-Ls]
                -t text output (default)
                -p <paper size name> PostScript output
                   like: a4, letter or legal
                -m <mapping> character mapping file
                -w <width> in characters of text output
                -i <level> image level (PostScript only)
                -L use landscape mode (PostScript only)
                -s Show hidden (by Word) text
[vepar]/usr/home/xav/bin>

The above commands have installed binaries to "~/bin", with data files saved to "~/.antiword". After compilation, the files are moved into position next to FDSE. Long commands have been continued to multiple lines for readability:

[vepar]/usr/home/xav> mkdir /usr/www/users/xav/cgi-bin/search/antiword
[vepar]/usr/home/xav> cp
		~/bin/*word
		/usr/www/users/xav/cgi-bin/search/antiword
[vepar]/usr/home/xav/antiword> cp -r
		~/bin/.antiword
		/usr/www/users/xav/cgi-bin/search/searchdata
[vepar]/usr/home/xav/antiword> chmod -R 755
	/usr/www/users/xav/cgi-bin/search/antiword
[vepar]/usr/home/xav/antiword> chmod -R 755
	/usr/www/users/xav/cgi-bin/search/searchdata/.antiword
[vepar]/usr/home/xav/bin>

    "Indexing Microsoft Word documents using runtime conversion with Antiword"
    http://www.xav.com/scripts/search/help/1182.html