Home > Fluid Dynamics Search Engine > Help > 1180

Maintain HTML files about each non-written binary document

Overview

To search non-written binary documents, such as images, audio, and video, one solution is to maintain HTML files about each binary. The benefits of this approach are:

This approach is similar to what you'll see if you search download.com for game demos, or if you search Yahoo! Movies for movie trailers. In each case, clicking on a link in the search results will take you to a text-rich HTML page about the binary file. That HTML page will then have a direct link to the binary itself, as well as help on how to download and view that file type.

This help file describes just one way to search binary files; see Searching binary files for others.

Steps to enable:

  1. Begin with a consistent naming convention whereby every binary document has an HTML file whose name is exactly the same, but with an additional ".html" extension.

    A directory listing would look like this:

    Mad_Max_2_Trailer.mpg
    Mad_Max_2_Trailer.mpg.html
    Oh_Canada.mp3
    Oh_Canada.mp3.html
  2. Next, use a consistent linking policy. Every time you link to a binary document, include a link to the HTML equivalent. For example:

    <a href="Mad_Max_2_Trailer.mpg">
    	Mad_Max_2_Trailer.mpg
    </a>
    (<a href="Mad_Max_2_Trailer.mpg.html">HTML info</a>)

    An example of such links can be seen here.

  3. At the top of every HTML copy, insert the following items:

    • A link to the binary file

    • Short instructions on how to view that kind of binary file

    • A descriptive title, META description, and META keywords (recommended)

    • If you want only FDSE to search these HTML documents, while all other search engines do not, then add the following META headers:

      <meta name="robots" content="noindex" />
      <meta name="fdse-robots" content="index" />

      These META headers tell all search engines to ignore the file, but then override that command for FDSE.

    • If you plan to complete steps 5-8 below, add META headers to set the file size and last-modified date of the binary:

      <meta name="fdse-content-length" content="21841" />
      <meta name="fdse-last-modified" content="Mon, 30 Jun 2003 20:55:00 GMT" />

      The content-length header is an integer for the byte size of the file. See Calculating last-modified time for date formats to use with the last-modified header.

    An example of an HTML copy with these insertions can be seen here.

  4. With the first three steps completed, your site is now ready to be searched. The FDSE search engine will be able to discover all HTML files about your binaries (because they are interlinked from all of your other pages). The FDSE crawler will be able to extract meaningful titles, descriptions, and keywords, because that information is stored in an open, computer-readable HTML format. And finally, FDSE will be able to modify the search results to point directly to the binary files, because a consistent naming convention has been used.

    Proceed with the next steps to make the search results point directly to the binaries.

  5. Edit source code file:

    /search/searchmods/common.pl

    Find the following lines of code:

    $pagedata{'file_type_icon'} = &get_file_type_icon_by_url( $pagedata{'url'} );
    
    return &PrintTemplate( 1, 'line_listing.txt', $::Rules{'language'}, \%pagedata, 0, \%::const);
  6. Insert the following custom code, so that those lines read:

    $pagedata{'html_format_link'} = '';
    
    # handle .mp3.html files:
    if ($pagedata{'url'} =~ m!\.mp3\.html$!i) {
    	$pagedata{'html_format_link'} = qq! - <a href="$pagedata{'url'}">HTML Info</a>!;
    	$pagedata{'url'} =~ s!(\.mp3)\.html$!$1!i;
    	}
    
    # handle .mpg.html files:
    if ($pagedata{'url'} =~ m!\.mpg\.html$!i) {
    	$pagedata{'html_format_link'} = qq! - <a href="$pagedata{'url'}">HTML Info</a>!;
    	$pagedata{'url'} =~ s!(\.mpg)\.html$!$1!i;
    	}
    
    # ... repeat as needed for other .binary.html types ...
    
    $pagedata{'file_type_icon'} = &get_file_type_icon_by_url( $pagedata{'url'} );
    
    return &PrintTemplate( 1, 'line_listing.txt', $::Rules{'language'}, \%pagedata, 0, \%::const);

    (Note that in older versions, $::Rules and %::const are written as $Rules and %const. Keep using whatever names are used by the code you are editing.)

  7. After making this change, the main search results will link directly to the binary MP3 or MPG file, even though the HTML file was searched. A secondary link to the HTML information file is included in the template variable %html_format_link%.

    To expose that secondary link, edit template file:

    /search/searchdata/templates/line_listing.txt

    This is the template that creates each search result line listing. By default, the HTML looks like this:

    <dl>
    	<dt><b>%Rank%. <a href="%Redirector%%URL%">%Title%</a></b> %admin_options%</dt>
    	<dd class="sr">
    		%Description%<br />
    		<b>URL:</b> %url% - %Size% - %Day% %Month% %Year%
    		%context_line%
    	</dd>
    </dl>

    You can just insert the variable %html_format_link% wherever you like. One solution is:

    <dl>
    	<dt><b>%Rank%. <a href="%Redirector%%URL%">%Title%</a></b>
    		%admin_options% %html_format_link%</dt>
    	<dd class="sr">
    		%Description%<br />
    		<b>URL:</b> %url% - %Size% - %Day% %Month% %Year%
    		%context_line%
    	</dd>
    </dl>

Here is an example of search results listings after steps 5-8 have been completed. Here, clicking on the bold title will take the visitor to the binary file (even though the HTML file was the one indexed by the search crawler). Click on the "HTML format" link will take the visitor to the HTML file.

Example of search results with HTML-equivalent links added.

Here is a similar example, with the additional customizations from Displaying file-type icons in search results.

Example of search results with HTML-equivalent links and file type icons added.


    "Maintain HTML files about each non-written binary document"
    http://www.xav.com/scripts/search/help/1180.html