Maintain HTML copies of each written binary document
Overview
It is recommended that you maintain HTML copies of all binary written documents. These copies should be stored alongside the binaries, and you should link to both formats. The benefits of this approach are:
Many web users are unwilling or unable to view binary files. Maintaining HTML copies is a great way to support these users, while still publishing in rich binary formats for the users who prefer them.
Many search engine crawlers cannot read binary formats. Adding a text version of each file increases the number of search engines that scan your site.
You can configure FDSE to link to both HTML and binary content in the search results.
You have maximum freedom in customizing the searchable title, description, keywords and text.
This help file describes just one way to search binary files; see Searching binary files for others.
Steps to enable:
-
Begin with a consistent naming convention whereby every binary document has an HTML equivalent whose name is exactly the same, but with an additional ".html" extension.
A directory listing would look like this:
Board_for_Social_Responsibility_2001.pdf Board_for_Social_Responsibility_2001.pdf.html Tips_for_Exercise_and_Motivation.doc Tips_for_Exercise_and_Motivation.doc.html
-
Next, use a consistent linking policy. Every time you link to a binary document, include a link to the HTML equivalent. For example:
<a href="Board_for_Social_Responsibility_2001.pdf"> Board_for_Social_Responsibility_2001.pdf </a> (<a href="Board_for_Social_Responsibility_2001.pdf.html">HTML format</a>)An example of such links can be seen here.
-
At the top of every HTML copy, insert the following items:
A link to the original binary file
Short instructions on how to view that kind of binary file
A descriptive title, META description, and META keywords (recommended)
-
If you want only FDSE to search these HTML documents, while all other search engines do not, then add the following META headers:
<meta name="robots" content="noindex" /> <meta name="fdse-robots" content="index" />These META headers tell all search engines to ignore the file, but then override that command for FDSE.
-
If you plan to complete steps 5-8 below, add META headers to set the file size and last-modified date of the binary:
<meta name="fdse-content-length" content="21841" /> <meta name="fdse-last-modified" content="Mon, 30 Jun 2003 20:55:00 GMT" />The content-length header is an integer for the byte size of the file. See Calculating last-modified time for date formats to use with the last-modified header.
An example of an HTML copy with these insertions can be seen here.
-
With the first three steps completed, your site is already completely accessible.
Visitors who are not able to view binaries will always have an option to view the document in HTML format, because each link to a binary is accompanied by a link to an HTML equivalent (step #2).
Visitors who stumble upon the HTML version of the file -- such as through a search result, or by someone sending them the link -- will always have the option of requesting the original binary, because the HTML version always starts with a link to the original (step #3).
The next steps are optional. They allow you to link to both the binary and HTML formats from the FDSE search result listings.
-
Make sure you are running FDSE version 2.0.0.0064 or newer.
-
Edit source code file:
/search/searchmods/common.pl
Find the following lines of code:
$pagedata{'file_type_icon'} = &get_file_type_icon_by_url( $pagedata{'url'} ); return &PrintTemplate( 1, 'line_listing.txt', $::Rules{'language'}, \%pagedata, 0, \%::const); -
Insert the following custom code, so that those lines read:
$pagedata{'html_format_link'} = ''; # handle .doc.html files: if ($pagedata{'url'} =~ m!\.doc\.html$!i) { $pagedata{'html_format_link'} = qq! - <a href="$pagedata{'url'}">View as HTML</a>!; $pagedata{'url'} =~ s!(\.doc)\.html$!$1!i; } # handle .pdf.html files: if ($pagedata{'url'} =~ m!\.pdf\.html$!i) { $pagedata{'html_format_link'} = qq! - <a href="$pagedata{'url'}">View as HTML</a>!; $pagedata{'url'} =~ s!(\.pdf)\.html$!$1!i; } # ... repeat as needed for other .binary.html types ... $pagedata{'file_type_icon'} = &get_file_type_icon_by_url( $pagedata{'url'} ); return &PrintTemplate( 1, 'line_listing.txt', $::Rules{'language'}, \%pagedata, 0, \%::const);(Note that in older versions,
$::Rulesand%::constare written as$Rulesand%const. Keep using whatever names are used by the code you are editing.) -
After making this change, the main search results will link directly to the binary PDF or DOC file, even though the HTML file was searched. A secondary link to the HTML-formatted file is included in the template variable
%html_format_link%.To expose that secondary link, edit template file:
/search/searchdata/templates/line_listing.txt
This is the template that creates each search result line listing. By default, the HTML looks like this:
<dl> <dt><b>%Rank%. <a href="%Redirector%%URL%">%Title%</a></b> %admin_options%</dt> <dd class="sr"> %Description%<br /> <b>URL:</b> %url% - %Size% - %Day% %Month% %Year% %context_line% </dd> </dl>You can just insert the variable
%html_format_link%wherever you like. One solution is:<dl> <dt><b>%Rank%. <a href="%Redirector%%URL%">%Title%</a></b> %admin_options% %html_format_link%</dt> <dd class="sr"> %Description%<br /> <b>URL:</b> %url% - %Size% - %Day% %Month% %Year% %context_line% </dd> </dl>
Here is an example of search results listings after steps 5-8 have been completed.

Here is a similar example, with the additional customizations from Displaying file-type icons in search results.

"Maintain HTML copies of each written binary document"
http://www.xav.com/scripts/search/help/1179.html