Home > Fluid Dynamics Search Engine > Help > 1184

Perform manual extraction of text from binary files

This help article describes how to make FDSE index binary documents, in cases where you do not have a user-friendly HTML copy of the binary, and where you do not have an automatic converter for the binary.

Overview

This approach involves creating a private HTML file that contains all of the information about the binary to be indexed, such as its title, description, keywords, text, last-modified date, and size. You create the file, and then configure FDSE to discover it. Finally, you configure FDSE to rewrite this file's URL in the search results so that it points to the original binary, instead of the private HTML file that was indexed.

Steps to enable:

  1. First, create a private HTML file. For best results, give the file the exact same name as the binary, but with a ".private.html" extension. Here is an example:

    -rwxr-xr-x  1 xav  users  24785176 Jul  3 22:01 MSAoE.exe
    -rwxr-xr-x  1 xav  users       929 Jul  3 22:36 MSAoE.exe.private.html
  2. In the HTML file, include the following items:

    • A meaningful title, META description, and META keywords.

    • An "fdse-content-length" META header, whose content is an integer for the byte size of the binary file.

    • An "fdse-last-modified" META header, whose content is the last-modified time of the binary file, formatted as an HTTP date string. See Calculating last-modified time for date formats.

    • A pair of robots exclusion META tags. One general tag forbids all search engines from indexing the private file. A separate FDSE-only tag overrides this rule for FDSE.

    • Finally, include any searchable text in the body of the document.

    Here is an example of META information and searchable text for the Age of Empires trial download:

    <html>
    <head>
    
    	<title>Microsoft Age of Empires 1.0 - Trial Version</title>
    	<meta name="description" content=
    		"Trial of first Age of Empires version.  Includes hours of game play
    		and six scenarios."
    	/>
    	<meta name="keywords" content=
    		"Microsoft Age of Empires 1.0 AoE trail demo freeware shareware"
    	/>
    
    	<meta name="fdse-content-length" content="24785176" />
    	<meta name="fdse-last-modified" content="Thu, 02 Oct 1997 00:00:00 GMT" />
    
    	<meta name="robots" content="noindex" />
    	<meta name="fdse-robots" content="index" />
    
    </head>
    <body>
    
    <!-- Enter additional searchable text here -->
    
    Single player campaign "Reign of the Hittites" includes scenarios Growing Pains,
    Opening Moves, Fall of the Mitanni, Battle of Kadesh.  Includes multi-player
    network game.
    
    Strategy game similar to Civilization, Command and Conquer Red Alert series
    
    <a href="MSAoE.exe">MSAoE.exe</a>
    
    </body>
    </html>
  3. Next, you need to configure FDSE so that it can discover the *.private.html files that you've created.

    If you are using the file system crawler (used with Runtime Realms and Website Realms - File System Discovery), then you don't need to do anything new. Just make sure that the "html" extension in included in the "Ext" General Setting.

    If you are using the web crawler -- which is used with most realms -- then use the "seed.html" technique described at Indexing pages which aren't linked from other pages.

  4. Next, rebuild your realm. Confirm that your *.private.html pages are found. At this point, the search results will link directly to those private HTML files.

  5. Finally, go to Admin Page => Filter Rules => (scroll down) => URL Rewrite Rules. Create an Output Filter with the following properties:

    Enabled: [x] (checked)
    Verbose: [_] (not checked)
    Pattern: \.private\.html$
    Replace:   (nothing - blank string)
    Comment: Map *.private.html links to the binary URL
  6. After creating this rule, perform test searches. All of the search results based on *.private.html files will link to the binary version.

See Maintain HTML files about each non-written binary document for a similar indexing technique, in which the HTML files are not kept private.

Note that if your private HTML files cannot be stored in the same folder as the binaries, you can still use this technique, but you will need a more complex URL Rewrite Rule.


    "Perform manual extraction of text from binary files"
    http://www.xav.com/scripts/search/help/1184.html