Indexing PDF files using runtime conversion with XPDF
This document describes how to index PDF files by automatically extracting the text from them. Because there aren't routines for extracting text from PDF that are native to Perl, FDSE must shell out to a separate helper application. FDSE uses the free XPDF application for this.
Steps to enable:
-
Make sure you're running FDSE version 2.0.0.0064 or newer.
-
Begin by going to FDSE Admin Page => General Settings. Your server operating system will be listed at the top of the page, in the sentence "Web server operating system is MSWin32". Typical operating systems are MSWin32, linux, and FreeBSD.
-
Next, go to the XPDF web site http://www.foolabs.com/xpdf/. Go to the downloads page and find a binary distribution that matches your server operating system.
(If your operating system is not listed, you can transfer the source code to your web server and compile it there, building your own binary. The steps involved are beyond the scope of this help file, though the XPDF distribution itself includes instructions on how to compile.)
-
Transfer the binaries from the XPDF web site down to your local computer, and then upload the binaries to your web server. You can place them anywhere, but a good location is in an "xpdf" subfolder of the main "search" folder. Your directory structure would then look like:
/search/ /search/searchdata/ /search/searchmods/ /search/search.pl /search/xpdf/ # new folder /search/xpdf/pdfinfo # new file /search/xpdf/pdftotext # new file
-
Apply read-execute permissions (chmod 755) to the pdfinfo and pdftotext utilities.
-
Next, you need to tell FDSE that XPDF is available. Edit the file /search/search.pl and find the lines:
%private = ( 'antiword utility folder' => "", 'pdf utility folder' => "",Enter the full or relative path to the XPDF folder. The "search/searchdata" folder is always the current working directory for this process, so create relative paths based on that folder. Include a trailing slash. Use the slash convention of your operating system, which is \\ for Windows and / for all others.
%private = ( 'antiword utility folder' => "", 'pdf utility folder' => "..\\xpdf\\", # Windows example%private = ( 'antiword utility folder' => "", 'pdf utility folder' => "../xpdf/", # non-Windows example -
Next, return to the FDSE Admin Page in your browser. Choose General Settings, then Binary Converters - Setup and Test.
-
The binary converters page will list all known converters and their enabled/disabled status. XPDF should be listed as enabled, because you've customized its variable. Click on "Syntax Test" to make sure all basic tests work. If they do not, adjusts the settings, file paths and file permissions until they are working. Contact the Fluid Dynamics support forum if you need help.
-
After the syntax test is finished, you need to test XPDF with a real PDF file. The first step in indexing a real file is configuring FDSE to discover PDF files. Go to Admin Page => General Settings => Crawler: Ignore Links To and remove "pdf" from the extension list. Then go to General Settings => Ext and add "pdf" to the extension list. Finally, return to "Binary Converters - Setup and Test" and click the "cross-reference" link. That will verify that your general settings match the binary converters that are loaded.
-
Next, create a realm named "Binary Conversion Test" which will include the PDF files that you want to test. Don't build the realm index file - just create the realm.
-
Return to the "Binary Converters - Setup and Test" page and reload it. There will now be a link labeled "index all files". Click it to index. When you begin indexing from this page, all XPDF actions include debug output, so you can see what is going on. Confirm that text is properly extracted from your PDF files.
-
Finally, once testing is done, you may delete the "Binary Conversion Test" realm and begin to index PDF files normally.
When using XPDF, all PDF headers are converted to HTML META tags. The PDF "keywords" attribute will be mapped to the "keywords" HTML META tag. The PDF "title" header, if present, will be mapped to the HTML <title>. If the PDF title is missing, as it often is, then FDSE will apply its rules for parsing HTML files without titles, usually using the file name itself as the title.
The installation of XPDF is not automatically covered under the FDSE free custom install policy. However, on request, I will try to install and test XPDF for all customers who have purchased a registration key, and for others on a case-by-case basis. Note that XPDF might not be available for all platforms.
Known Problems: things to keep in mind if you have trouble:
-
Parsing a PDF file is resource-intensive and slow. A 3 MB test file took 31 seconds to parse. Indexing 100 such files would take about an hour.
-
XPDF may crash with a memory error if it is passed an invalid PDF file. This is mostly just an annoyance, but on Windows 2000 it will cause pop-up error messages to accumulate on the console.
-
The "Max Characters: File" setting causes most documents to only be read through the first 64,000 characters. This is smaller than most PDF files, and sending a truncated PDF file to XPDF will cause it to crash. FDSE works around this problem for the majority of cases by ignoring the "Max Characters: File" setting for files which have the ".pdf" extension. However, if you are retrieving PDF files from the web and the document URL does not end in ".pdf", then you may experience this problem. You can work around it by setting "Max Characters: File" to 0 to bypass truncation, or by setting it to a sufficiently large value.
-
FDSE cannot distinguish between a valid response from pdftotext and an invalid response (like "unable to parse PDF file"). If XPDF is unable to parse a file, it will still be indexed, using its filename as the title, and using "..." as the description. If you find records like this in your search results, you may re-index them using the Binary Conversion Test so that all error and status information is shown. That will provide a clue as to what the problem is. If you are unable to resolve the problem, you may delete the record from the index.
-
The web crawler will attempt PDF-to-text conversion on only those documents which return the Content-Type "application/pdf". If the PDF files are not returning an accurate Content-Type header, then they will not be processed properly.
-
PDF files can contain a mix of inlined images and computer-readable formatted text. FDSE is only able to "read" the formatted text, and that is with the help of the XPDF toolkit (which strips formatting and may mangle some words in non-Latin languages). Neither FDSE nor the XPDF toolkit can read text that is stored inside the inlined images. Thus, image-based PDF files, particularly faxes that have been saved to PDF format, cannot be meaningfully searched because they contain only inline image content, and no computer-readable formatted text.
-
XPDF will not convert PDF files that are password-locked or which are secured against to-text conversion.
See Maintain HTML copies of each written binary document for another way to index PDF content.
See also File pathname restrictions when using binary converters
"Indexing PDF files using runtime conversion with XPDF"
http://www.xav.com/scripts/search/help/1181.html