Home > Fluid Dynamics Search Engine > Help > 1181

Indexing PDF files using runtime conversion with XPDF

This document describes how to index PDF files by automatically extracting the text from them. Because there aren't routines for extracting text from PDF that are native to Perl, FDSE must shell out to a separate helper application. FDSE uses the free XPDF application for this.

Steps to enable:

  1. Make sure you're running FDSE version or newer.

  2. Begin by going to FDSE Admin Page => General Settings. Your server operating system will be listed at the top of the page, in the sentence "Web server operating system is MSWin32". Typical operating systems are MSWin32, linux, and FreeBSD.

  3. Next, go to the XPDF web site http://www.foolabs.com/xpdf/. Go to the downloads page and find a binary distribution that matches your server operating system.

    (If your operating system is not listed, you can transfer the source code to your web server and compile it there, building your own binary. The steps involved are beyond the scope of this help file, though the XPDF distribution itself includes instructions on how to compile.)

  4. Transfer the binaries from the XPDF web site down to your local computer, and then upload the binaries to your web server. You can place them anywhere, but a good location is in an "xpdf" subfolder of the main "search" folder. Your directory structure would then look like:

    /search/xpdf/                # new folder
    /search/xpdf/pdfinfo         # new file
    /search/xpdf/pdftotext       # new file
  5. Apply read-execute permissions (chmod 755) to the pdfinfo and pdftotext utilities.

  6. Next, you need to tell FDSE that XPDF is available. Edit the file /search/search.pl and find the lines:

    %private = (
    	'antiword utility folder'  => "",
    	'pdf utility folder' => "",

    Enter the full or relative path to the XPDF folder. The "search/searchdata" folder is always the current working directory for this process, so create relative paths based on that folder. Include a trailing slash. Use the slash convention of your operating system, which is \\ for Windows and / for all others.

    %private = (
    	'antiword utility folder'  => "",
    	'pdf utility folder' => "..\\xpdf\\", # Windows example
    %private = (
    	'antiword utility folder'  => "",
    	'pdf utility folder' => "../xpdf/",   # non-Windows example
  7. Next, return to the FDSE Admin Page in your browser. Choose General Settings, then Binary Converters - Setup and Test.

  8. The binary converters page will list all known converters and their enabled/disabled status. XPDF should be listed as enabled, because you've customized its variable. Click on "Syntax Test" to make sure all basic tests work. If they do not, adjusts the settings, file paths and file permissions until they are working. Contact the Fluid Dynamics support forum if you need help.

  9. After the syntax test is finished, you need to test XPDF with a real PDF file. The first step in indexing a real file is configuring FDSE to discover PDF files. Go to Admin Page => General Settings => Crawler: Ignore Links To and remove "pdf" from the extension list. Then go to General Settings => Ext and add "pdf" to the extension list. Finally, return to "Binary Converters - Setup and Test" and click the "cross-reference" link. That will verify that your general settings match the binary converters that are loaded.

  10. Next, create a realm named "Binary Conversion Test" which will include the PDF files that you want to test. Don't build the realm index file - just create the realm.

  11. Return to the "Binary Converters - Setup and Test" page and reload it. There will now be a link labeled "index all files". Click it to index. When you begin indexing from this page, all XPDF actions include debug output, so you can see what is going on. Confirm that text is properly extracted from your PDF files.

  12. Finally, once testing is done, you may delete the "Binary Conversion Test" realm and begin to index PDF files normally.

When using XPDF, all PDF headers are converted to HTML META tags. The PDF "keywords" attribute will be mapped to the "keywords" HTML META tag. The PDF "title" header, if present, will be mapped to the HTML <title>. If the PDF title is missing, as it often is, then FDSE will apply its rules for parsing HTML files without titles, usually using the file name itself as the title.

The installation of XPDF is not automatically covered under the FDSE free custom install policy. However, on request, I will try to install and test XPDF for all customers who have purchased a registration key, and for others on a case-by-case basis. Note that XPDF might not be available for all platforms.

Known Problems: things to keep in mind if you have trouble:

See Maintain HTML copies of each written binary document for another way to index PDF content.

See also File pathname restrictions when using binary converters

    "Indexing PDF files using runtime conversion with XPDF"