Indexing Microsoft Word documents using runtime conversion with Antiword
This document describes how to index Microsoft Word documents by automatically extracting the text from them. Because there aren't routines for extracting text that are native to Perl, FDSE must shell out to a separate helper application. FDSE uses the free Antiword application for this.
Steps to enable:
-
Make sure you're running FDSE version 2.0.0.0064 or newer.
-
Begin by going to FDSE Admin Page => General Settings. Your server operating system will be listed at the top of the page, in the sentence "Web server operating system is MSWin32". Typical operating systems are MSWin32, linux, and FreeBSD.
-
Next, go to the Antiword web site http://www.winfield.demon.nl/. If your server operating system is MSWin32, download the Windows binaries.
If your server OS is not Windows, then download the source code and transfer it to your server. You must then compile the software, which requires a shell account (telnet or ssh). If you do not have that level of access, you may contact your web host to ask them to compile the software for you, or provide shell access. Follow the antiword instructions for compiling. A list of commands used for compiling can be found at the end of this file.
-
Upload the binaries to your web server.
-
If you are using Windows, you will have an "antiword" folder which includes "antiword.exe" and a bunch of text files such as "8859-1.txt".
Create an "antiword" folder as a subfolder of "search", and move the "antiword.exe" executable there.
Next, create an ".antiword" folder inside the "search/searchdata" folder. Note the leading dot in ".antiword". The Windows explorer may prevent you from creating a folder with a leading dot; if that happens, you must open a command prompt and type "mkdir .antiword". Once the folder has been created by the command prompt, you may return to the Windows explorer to use the folder.
Move all of the text files from the original "antiword" folder up to "search/searchdata/.antiword".
Your directory structure will look like:
/search/ /search/antiword/ # new folder /search/antiword/antiword.exe # new file /search/searchdata/ /search/searchdata/.antiword/ # new folder /search/searchdata/.antiword/8859-1.txt # new file /search/searchmods/ /search/search.pl
-
On other platforms, you will have a "bin" folder containing the "antiword" executable and possibly a "kantiword" file. You will then have a separate folder named ".antiword" that contains data files like "8859-1.txt".
Create an "antiword" folder as a subfolder of "search", and move the "antiword" executable and "kantiword" file there. Then, move the entire ".antiword" folder inside the "search/searchdata" folder. Your directory structure would then look like:
/search/ /search/antiword/ # new folder /search/antiword/antiword # new file /search/antiword/kantiword # new file /search/searchdata/ /search/searchdata/.antiword/ # new folder /search/searchdata/.antiword/8859-1.txt # new file /search/searchmods/ /search/search.pl
-
-
Apply read-execute permissions (chmod 755) to the antiword utility. Make sure the data files like 8859-1.txt are readable (chmod 644).
-
Next, you need to tell FDSE that Antiword is available. Edit the file /search/search.pl and find the lines:
%private = ( 'antiword utility folder' => "", 'pdf utility folder' => "",Enter the full or relative path to the Antiword folder. The "search/searchdata" folder is always the current working directory for this process, so create relative paths based on that folder. Include a trailing slash. Use the slash convention of your operating system, which is \\ for Windows and / for all others.
%private = ( 'antiword utility folder' => "..\\antiword\\", # Windows example 'pdf utility folder' => "",%private = ( 'antiword utility folder' => "../antiword/", # non-Windows example 'pdf utility folder' => "", -
Next, return to the FDSE Admin Page in your browser. Choose General Settings, then Binary Converters - Setup and Test.
-
The binary converters page will list all known converters and their enabled/disabled status. Antiword should be listed as enabled, because you've customized its variable. Click on "Syntax Test" to make sure all basic tests work. If they do not, adjusts the settings, file paths and file permissions until they are working. Contact the Fluid Dynamics support forum if you need help.
-
After the syntax test is finished, you need to test Antiword with a real Word document. The first step in indexing a real file is configuring FDSE to discover Word documents. Go to Admin Page => General Settings => Crawler: Ignore Links To and remove "doc" from the extension list. Then go to General Settings => Ext and add "doc" to that extension list. Finally, return to "Binary Converters - Setup and Test" and click the "cross-reference" link. That will verify that your general settings match the binary converters that are loaded.
-
Next, create a realm named "Binary Conversion Test" which will include the Word documents that you want to test. Don't rebuild the realm - just create it.
-
Return to the "Binary Converters - Setup and Test" page and reload it. There will now be a link labeled "index all files". Click it to index. When you begin indexing from this page, all Antiword actions will include debug output, so you can see what is going on. Confirm that text is properly extracted from your Word documents.
-
Finally, once testing is done, you may delete the "Binary Conversion Test" realm and begin to index Word documents normally.
Antiword extracts text only; there are no HTML titles, META descriptions, or META keywords. FDSE will use the filename as the title in the search results, and the first few words of the content as the description.
The installation of Antiword is not automatically covered under the FDSE free custom install policy. However, on request, I will try to install and test Antiword for all customers who have purchased a registration key, and for others on a case-by-case basis. Note that Antiword might not be available for all platforms.
Known Problems: things to keep in mind if you have trouble:
-
The "Max Characters: File" setting causes most documents to only be read through the first 64,000 characters. This is smaller than many Word documents, and sending a truncated Word document to Antiword will cause it to fail. FDSE works around this problem for the majority of cases by ignoring the "Max Characters: File" setting for files which have the ".doc" extension. However, if you are retrieving Word documents from the web and the document URL does not end in ".doc", then you may experience this problem. You can work around it by setting "Max Characters: File" to 0 to bypass truncation, or by setting it to a sufficiently large value.
-
FDSE cannot distinguish between a valid response from Antiword and an invalid response (like "foo.doc is not a Word Document"). If Antiword is unable to parse a document, it will still be indexed, using its filename as the title, and using "No description available" as the description. If you find records like this in your search results, you may re-index them using the Binary Conversion Test so that all error and status information is shown. That will provide a clue as to what the problem is. If you are unable to resolve the problem, you may delete the record from the index.
-
The web crawler will attempt Word-to-text conversion on only those documents which return the Content-Type "application/msword". If the Word documents are not returning an accurate Content-Type header, then they will not be processed properly.
See Maintain HTML copies of each written binary document for another way to index Word documents.
See also File pathname restrictions when using binary converters
Steps to compile:
Here are the commands used to compile Antiword on FreeBSD; commands are in blue. The long lynx command has been continued to two lines for readability:
[vepar]/usr/home/xav> mkdir antiword [vepar]/usr/home/xav> cd antiword [vepar]/usr/home/xav/antiword> lynx -source http://www.winfield.demon.nl/linux/antiword-0.33.tar.gz > aw.tar.gz [vepar]/usr/home/xav/antiword> ls -al total 502 drwx------ 2 xav users 512 Jul 3 09:38 . drwx-----x 15 xav users 1536 Jul 3 09:37 .. -rw------- 1 xav users 240684 Jul 3 09:38 aw.tar.gz [vepar]/usr/home/xav/antiword> tar -xzf aw.tar.gz [vepar]/usr/home/xav/antiword> ls -al total 506 drwx------ 3 xav users 512 Jul 3 09:38 . drwx-----x 15 xav users 1536 Jul 3 09:37 .. drwx------ 8 xav users 2048 Jul 10 2002 antiword-0.33 -rw------- 1 xav users 240684 Jul 3 09:38 aw.tar.gz [vepar]/usr/home/xav/antiword> cd anti* [vepar]/usr/home/xav/antiword/antiword-0.33> make [vepar]/usr/home/xav/antiword/antiword-0.33> make install mkdir -p /usr/home/xav/bin cp -pf antiword kantiword /usr/home/xav/bin mkdir -p /usr/home/xav/.antiword cp -pf Resources/* /usr/home/xav/.antiword [vepar]/usr/home/xav/antiword/antiword-0.33> cd ~/bin [vepar]/usr/home/xav/bin> chmod -R 755 . [vepar]/usr/home/xav/bin> chmod -R 755 ~/.antiword [vepar]/usr/home/xav/bin> ls -al total 312 drwxr-xr-x 2 xav users 512 Jul 3 09:40 . drwx-----x 16 xav users 1536 Jul 3 09:40 .. -rwxr-xr-x 1 xav users 145036 Jul 3 09:40 antiword -rwxr-xr-x 1 xav users 840 Jul 8 2001 kantiword [vepar]/usr/home/xav/bin> [vepar]/usr/home/xav/bin> antiword Name: antiword Purpose: Display MS-Word files Author: (C) 1998-2002 Adri van Os Version: 0.33 (05 Jul 2002) Status: GNU General Public License Usage: antiword [switches] wordfile1 [wordfile2 ...] Switches: [-t|-p papersize][-m mapping][-w #][-i #][-Ls] -t text output (default) -p <paper size name> PostScript output like: a4, letter or legal -m <mapping> character mapping file -w <width> in characters of text output -i <level> image level (PostScript only) -L use landscape mode (PostScript only) -s Show hidden (by Word) text [vepar]/usr/home/xav/bin>
The above commands have installed binaries to "~/bin", with data files saved to "~/.antiword". After compilation, the files are moved into position next to FDSE. Long commands have been continued to multiple lines for readability:
[vepar]/usr/home/xav> mkdir /usr/www/users/xav/cgi-bin/search/antiword [vepar]/usr/home/xav> cp ~/bin/*word /usr/www/users/xav/cgi-bin/search/antiword [vepar]/usr/home/xav/antiword> cp -r ~/bin/.antiword /usr/www/users/xav/cgi-bin/search/searchdata [vepar]/usr/home/xav/antiword> chmod -R 755 /usr/www/users/xav/cgi-bin/search/antiword [vepar]/usr/home/xav/antiword> chmod -R 755 /usr/www/users/xav/cgi-bin/search/searchdata/.antiword [vepar]/usr/home/xav/bin>
"Indexing Microsoft Word documents using runtime conversion with Antiword"
http://www.xav.com/scripts/search/help/1182.html