Home > Fluid Dynamics Search Engine > Help > 1092

Searching PDF files

This help file is no longer current; it has been replaced by the content in:

The previous content of this help file is listed below for those who need to reference it.


There are three ways to search PDF files with this search engine.

  1. Maintain Parallel HTML and PDF Files - Recommended

    On your site, provide an equivalent HTML file for each original PDF. Always link to both. In addition, in the header of the HTML file, link to the original PDF.

    When FDSE indexes your site, it will select the HTML version and index all keywords associated with it. When visitors bring up this file in the search results, they will see the HTML version. From that HTML version, in the header, they will see your link to the PDF original and so they will be able to view or download it if they desire.

    This solution is recommended because your files will be accessible by the widest audience, including visitors who are not able to handle PDF files. In addition, this solution will make your files be visible to most search engines. No FDSE code or settings changes will need to be made.

  2. Index PDF Files with Xpdf

    You can install the free Xpdf toolkit and tell FDSE to use it. The search engine will then automatically perform PDF-to-text conversion on each file it encounters, and index the resulting text. Search results for that text will point directly to the original PDF file.

    In this scenario, you must install the Xpdf toolkit and customize the FDSE code and settings. Details are provided below.

  3. Index PDF Files without Xpdf

    You can create your own private text-only versions of each PDF file, and you can make FDSE aware of these equivalent forms. FDSE will then index the text-only files that it understands, but search results for that text will point directly to the original PDF file.

    In this scenario, you must perform your own PDF-to-text conversion, and you need make a few customizations to the output text. Details are provided below.


2. Details - Indexing PDF Files with Xpdf

The Fluid Dynamics Search Engine can search PDF files when used with a helper utility, the Xpdf package from www.foolabs.com/xpdf. This is a package of free C++ programs that run on most operating systems.

Follow these steps to integrate Xpdf and FDSE:

  1. Make sure you have FDSE version 2.0.0.0046 or newer.

  2. Download the xpdf version appropriate to your web server's operating system. Transfer the executable files (specifically pdfinfo and pdftotext) to a folder on your web server. (Your web server's operating system is listed on the FDSE General Settings page if you need it.)

    If your search engine is running on Unix, you will need to chmod the pdfinfo and pdftotext executables to "all read/execute" or "755". Files on Windows servers do not need the chmod command.

  3. Open the main FDSE script, search.pl or search.cgi. Scroll down about 50 lines, and find the line labeled:

    %private = (
    'pdf utility folder' => "",
  4. Enter the absolute path to the xpdf folder. Because FDSE will shell to this folder, you must include the trailing slash, and you must use the slash convention appropriate to your web server's operating system, i.e. "x:\\xpdf\\" on Windows and "/x/xpdf/" on Unix. (The double backslash \\ is used because \ is a control character in Perl and must be escaped.) Examples:

    %private = (
    'pdf utility folder' => "x:\\xpdf\\", # windows
    
    %private = (
    'pdf utility folder' => "/x/xpdf/", # Unix
  5. After editing the search script, return it to the web server and test your changes by making some normal search requests. If you see a Perl execution error, confirm that your changes use the correct syntax with matched quotations, etc.

  6. From the FDSE Admin Page, edit the General Setting "Ext" by adding the "pdf" file extension to the list. Next, edit the General Setting "Crawler: Ignore Links To" by removing the "pdf" file extension from that list. Confirm that the "AllowBinaryFiles" General Setting is checked.

  7. If you FDSE version 2.0.0.0056 or newer, go to Admin Page => General Settings => View all system information => XPDF Folder: Test. This will take you to a set of routines with test the interoperability. If you have sockets enabled, you can do a live test by indexing an IRS document in PDF format. This test page has been designed to detect all commonly-reported problems.

    To test the system using any version of FDSE, simply try to index a PDF file. If all the text appears properly, then things are probably working. If there are problems, you can try to index a file with the "debug=1" flag. For example:

    search.pl?Mode=Admin&Action=AddURL&URL=http://xav.com/scripts/search/pl2000.pdf&debug=1

FDSE converts all PDF headers to META tags. The PDF "keywords" attribute will be mapped to the "keywords" HTML META tag. The PDF "title" header, if present, will be mapped to the HTML <title>. If the PDF title is missing, as it often is, then FDSE will apply its rules for parsing HTML files without titles, usually using the file name itself as the title.

The installation of xpdf is not automatically covered under the FDSE free custom install policy. However, on request, I will try to install and test xpdf for all customers who have purchased a registration key, and for others on a case-by-case basis. Note that xpdf might not be available for all platforms.

Known Problems: things to keep in mind if you have trouble:

Coding: handling of PDF files is controlled by subroutine convert_pdf_to_text which is found in the "searchmods/common_parse_page.pl" library. It is called from subroutines webrequest and pagedata_from_file.

If all of your PDF files tend to have their descriptions stored in the "subject" PDF header, rather than the "title" header, you may want to edit convert_pdf_to_text to pull the HTML title from the "subject" header instead.

History: support for Xpdf integration was added with FDSE version 2.0.0.0046.

In FDSE versions 2.0.0.0046 through 0052, the "pdf utility folder" information was stored in the %const hash. With the 0053 release, it was moved to the new %private hash. Users of those older versions should continue to modify the %const hash.

Special thanks are due to Derek B. Noonburg for creating xpdf and distributing it for free; and to Andrew Mossberg for telling me about the product, after I had given up all hope of ever parsing PDF.


3. Details - Indexing PDF Files without Xpdf

Take these steps to search PDF files based on private text-only versions of those files:

  1. First, create your own text version of each PDF file on your site. There are a variety of utilities for converting PDF content to text. You may be able to cut-and-paste the content directly into a text file.

    Don't worry about converting to HTML - just dump all of the text into a file, with or without HTML tags.

  2. Edit the resulting text/HTML file by adding a title tag:

    <title>My Title of My Document</title>
    ... cut-n-paste text from original PDF ...
    ... more cut-n-paste text ...
    ... more cut-n-paste text ...
  3. Next, upload this file to your web server. You can place the file anywhere, but as a standard, you may want to place it in the same folder as the original PDF, with an identical filename but different extension, like MyDocument_pdf.htm for PDF file MyDocument.pdf.

  4. Next, take steps to ensure that this HTML file will be found when you build your search index, and that the original PDF will not be found.

    For a realm type of "Website Realm - File System Crawler", simply place the HTML file within the main folder to ensure it will be found. Make sure that the General Setting "Ext" includes the "htm" extension and does not include the "pdf" extension.

    For other website realms, you may want to use a seed file which links to the HTML file. More information on this technique is described at Indexing pages which aren't linked from other pages. That will ensure that the HTML file is found. To ensure that the original PDF file is not found, make sure that the "PDF" extension is included in the General Setting "Crawler: Ignore Links To".

  5. Next, you must create an output filter to map your private HTML files over to the PDF version of the URL. To do this:

    1. Confirm that you are running at least FDSE version 2.0.0.0054, the first to support output filters.

    2. Go to Admin Page => Filter Rules => (scroll down) => URL Rewrite Rules.

    3. Scroll down to the "Output Filters" section at the bottom of the page.

      In one of the open slots labeled "new rewrite rule", create a new rule. If you use a consistent naming convention, you may create just a single rule that applies to all documents, using these parameters:

      Enabled: [x] (checked)
      Verbose: [_] (not checked)
      Pattern: _pdf.htm
      Replace: .pdf
      Comment: force PDF URL for all

      Then click "Save Data".

      If you do not use a consistent naming convention, you will need a separate output filter rule for each URL:

      Enabled: [x] (checked)
      Verbose: [_] (not checked)
      Pattern: http://mysite.tld/folder/MyDocument_pdf.htm
      Replace: http://mysite.tld/folder/MyDocument.pdf
      Comment: force PDF URL for http://mysite.tld/folder/MyDocument.pdf

      Customize the parts in red to match your site.

  6. Finally, rebuild your search index.

    The indexer will encounter the HTML file and index all of the keywords that it finds within. Because of the output filter, the search result links will point to the PDF file, instead of the original HTML.

Updated 2003-06-02


    "Searching PDF files"
    http://www.xav.com/scripts/search/help/1092.html