Searching PDF files
This help file is no longer current; it has been replaced by the content in:
The previous content of this help file is listed below for those who need to reference it.
There are three ways to search PDF files with this search engine.
-
Maintain Parallel HTML and PDF Files - Recommended
On your site, provide an equivalent HTML file for each original PDF. Always link to both. In addition, in the header of the HTML file, link to the original PDF.
When FDSE indexes your site, it will select the HTML version and index all keywords associated with it. When visitors bring up this file in the search results, they will see the HTML version. From that HTML version, in the header, they will see your link to the PDF original and so they will be able to view or download it if they desire.
This solution is recommended because your files will be accessible by the widest audience, including visitors who are not able to handle PDF files. In addition, this solution will make your files be visible to most search engines. No FDSE code or settings changes will need to be made.
-
Index PDF Files with Xpdf
You can install the free Xpdf toolkit and tell FDSE to use it. The search engine will then automatically perform PDF-to-text conversion on each file it encounters, and index the resulting text. Search results for that text will point directly to the original PDF file.
In this scenario, you must install the Xpdf toolkit and customize the FDSE code and settings. Details are provided below.
-
Index PDF Files without Xpdf
You can create your own private text-only versions of each PDF file, and you can make FDSE aware of these equivalent forms. FDSE will then index the text-only files that it understands, but search results for that text will point directly to the original PDF file.
In this scenario, you must perform your own PDF-to-text conversion, and you need make a few customizations to the output text. Details are provided below.
2. Details - Indexing PDF Files with Xpdf
The Fluid Dynamics Search Engine can search PDF files when used with a helper utility, the Xpdf package from www.foolabs.com/xpdf. This is a package of free C++ programs that run on most operating systems.
Follow these steps to integrate Xpdf and FDSE:
-
Make sure you have FDSE version 2.0.0.0046 or newer.
-
Download the xpdf version appropriate to your web server's operating system. Transfer the executable files (specifically pdfinfo and pdftotext) to a folder on your web server. (Your web server's operating system is listed on the FDSE General Settings page if you need it.)
If your search engine is running on Unix, you will need to chmod the pdfinfo and pdftotext executables to "all read/execute" or "755". Files on Windows servers do not need the chmod command.
-
Open the main FDSE script, search.pl or search.cgi. Scroll down about 50 lines, and find the line labeled:
%private = ( 'pdf utility folder' => "", -
Enter the absolute path to the xpdf folder. Because FDSE will shell to this folder, you must include the trailing slash, and you must use the slash convention appropriate to your web server's operating system, i.e. "x:\\xpdf\\" on Windows and "/x/xpdf/" on Unix. (The double backslash \\ is used because \ is a control character in Perl and must be escaped.) Examples:
%private = ( 'pdf utility folder' => "x:\\xpdf\\", # windows %private = ( 'pdf utility folder' => "/x/xpdf/", # Unix -
After editing the search script, return it to the web server and test your changes by making some normal search requests. If you see a Perl execution error, confirm that your changes use the correct syntax with matched quotations, etc.
-
From the FDSE Admin Page, edit the General Setting "Ext" by adding the "pdf" file extension to the list. Next, edit the General Setting "Crawler: Ignore Links To" by removing the "pdf" file extension from that list. Confirm that the "AllowBinaryFiles" General Setting is checked.
-
If you FDSE version 2.0.0.0056 or newer, go to Admin Page => General Settings => View all system information => XPDF Folder: Test. This will take you to a set of routines with test the interoperability. If you have sockets enabled, you can do a live test by indexing an IRS document in PDF format. This test page has been designed to detect all commonly-reported problems.
To test the system using any version of FDSE, simply try to index a PDF file. If all the text appears properly, then things are probably working. If there are problems, you can try to index a file with the "debug=1" flag. For example:
search.pl?Mode=Admin&Action=AddURL&URL=http://xav.com/scripts/search/pl2000.pdf&debug=1
FDSE converts all PDF headers to META tags. The PDF "keywords" attribute will be mapped to the "keywords" HTML META tag. The PDF "title" header, if present, will be mapped to the HTML <title>. If the PDF title is missing, as it often is, then FDSE will apply its rules for parsing HTML files without titles, usually using the file name itself as the title.
The installation of xpdf is not automatically covered under the FDSE free custom install policy. However, on request, I will try to install and test xpdf for all customers who have purchased a registration key, and for others on a case-by-case basis. Note that xpdf might not be available for all platforms.
Known Problems: things to keep in mind if you have trouble:
-
Parsing a PDF file is resource-intensive and slow. A 3 MB test file took 31 seconds to parse. Indexing 100 such files would take about an hour.
-
xpdf may crash with a memory error if it is passed an invalid PDF file. This is mostly just an annoyance, but on Windows 2000 it will cause pop-up error messages to accumulate on the console.
-
The "Max Characters: File" setting causes most documents to only be read through the first 64,000 characters. This is smaller than most PDF files, and sending a truncated PDF file to xpdf will cause it to crash. FDSE works around this problem for the majority of cases by ignoring the "Max Characters: File" setting for files which have the ".pdf" extension. However, if you are retrieving PDF files from the web and the document URL does not end in ".pdf", then you may experience this problem. You can work around it by setting "Max Characters: File" to 0 to bypass truncation, or by setting it to a sufficiently large value.
-
FDSE cannot distinguish between a valid response from pdftotext and an invalid response (like "unable to parse PDF file"). In most cases, the General Setting "Minimum Page Size" will cause FDSE to ignore pages which return short error messages, but there remains an outside chance that inaccurate information will be indexed as valid data.
-
The web crawler will attempt PDF-to-text conversion on only those documents which return the Content-Type "application/pdf". If the PDF files are not returning an accurate Content-Type header, then they will not be processed properly.
-
PDF files can contain a mix of inlined images and computer-readable formatted text. FDSE is only able to "read" the formatted text, and that is with the help of the xpdf toolkit (which strips formatting and may mangle some words in non-Latin languages). Neither FDSE nor the xpdf toolkit can read text that is stored inside the inlined images. Thus, image-based PDF files, particularly faxes that have been saved to PDF format, cannot be meaningfully searched because they contain only inline image content, and no computer-readable formatted text.
Coding: handling of PDF files is controlled by subroutine convert_pdf_to_text which is found in the "searchmods/common_parse_page.pl" library. It is called from subroutines webrequest and pagedata_from_file.
If all of your PDF files tend to have their descriptions stored in the "subject" PDF header, rather than the "title" header, you may want to edit convert_pdf_to_text to pull the HTML title from the "subject" header instead.
History: support for Xpdf integration was added with FDSE version 2.0.0.0046.
In FDSE versions 2.0.0.0046 through 0052, the "pdf utility folder" information was stored in the %const hash. With the 0053 release, it was moved to the new %private hash. Users of those older versions should continue to modify the %const hash.
Special thanks are due to Derek B. Noonburg for creating xpdf and distributing it for free; and to Andrew Mossberg for telling me about the product, after I had given up all hope of ever parsing PDF.
3. Details - Indexing PDF Files without Xpdf
Take these steps to search PDF files based on private text-only versions of those files:
-
First, create your own text version of each PDF file on your site. There are a variety of utilities for converting PDF content to text. You may be able to cut-and-paste the content directly into a text file.
Don't worry about converting to HTML - just dump all of the text into a file, with or without HTML tags.
-
Edit the resulting text/HTML file by adding a title tag:
<title>My Title of My Document</title> ... cut-n-paste text from original PDF ... ... more cut-n-paste text ... ... more cut-n-paste text ... -
Next, upload this file to your web server. You can place the file anywhere, but as a standard, you may want to place it in the same folder as the original PDF, with an identical filename but different extension, like MyDocument_pdf.htm for PDF file MyDocument.pdf.
-
Next, take steps to ensure that this HTML file will be found when you build your search index, and that the original PDF will not be found.
For a realm type of "Website Realm - File System Crawler", simply place the HTML file within the main folder to ensure it will be found. Make sure that the General Setting "Ext" includes the "htm" extension and does not include the "pdf" extension.
For other website realms, you may want to use a seed file which links to the HTML file. More information on this technique is described at Indexing pages which aren't linked from other pages. That will ensure that the HTML file is found. To ensure that the original PDF file is not found, make sure that the "PDF" extension is included in the General Setting "Crawler: Ignore Links To".
-
Next, you must create an output filter to map your private HTML files over to the PDF version of the URL. To do this:
Confirm that you are running at least FDSE version 2.0.0.0054, the first to support output filters.
Go to Admin Page => Filter Rules => (scroll down) => URL Rewrite Rules.
-
Scroll down to the "Output Filters" section at the bottom of the page.
In one of the open slots labeled "new rewrite rule", create a new rule. If you use a consistent naming convention, you may create just a single rule that applies to all documents, using these parameters:
Enabled: [x] (checked)
Verbose: [_] (not checked)
Pattern: _pdf.htm
Replace: .pdf
Comment: force PDF URL for allThen click "Save Data".
If you do not use a consistent naming convention, you will need a separate output filter rule for each URL:
Enabled: [x] (checked)
Verbose: [_] (not checked)
Pattern: http://mysite.tld/folder/MyDocument_pdf.htm
Replace: http://mysite.tld/folder/MyDocument.pdf
Comment: force PDF URL for http://mysite.tld/folder/MyDocument.pdfCustomize the parts in red to match your site.
-
Finally, rebuild your search index.
The indexer will encounter the HTML file and index all of the keywords that it finds within. Because of the output filter, the search result links will point to the PDF file, instead of the original HTML.
Updated 2003-06-02
"Searching PDF files"
http://www.xav.com/scripts/search/help/1092.html