Home > Fluid Dynamics Search Engine > Help > 1056

Format of the text index files

Each line in the index file maps to a single record, and each record maps to a single document. Each record consists of a 16-byte numeric header, followed by several delimited text fields, and terminated with a newline. The fields are:

  1. 16-byte numeric header
  2. document last modified time, expressed as number of seconds since 1970
  3. document last index time, as number of seconds since 1970
  4. URL, u
  5. title, t
  6. description, d
  7. searchable* URL, uM
  8. searchable title, uT
  9. searchable description, uD
  10. searchable keywords, uK
  11. document text, h
  12. searchable links, l

A "searchable" field contains text which has been run through the CompressStrip subroutine. For example, if a document title is "Sally Sue-Smith", it will be recorded literally in that form in the title field but the searchable title will be stored as "sally sue smith" (lowercase with punctuation stripped) to allow for exact literal matching against search keywords.

The 16-byte numeric header contains this info:

Example Text Record

Here is an example record for the whitehouse.gov main page. The record has been word-wrapped for readability, but in the file will be on a single line:

0102102001028491 1004742542 1004745182 u= http://www.whitehouse.gov/ t= Welcome to the White House d= Whitehouse.gov is the official web site for the White House uM= http www whitehouse gov uT= welcome to the white house uD= whitehouse gov is the official web site for the white house uK= energy tax news w bush policies h= skip to content text only skip to search president news and policies vice president history and tours {...} l=

Explanation:

01
	- means this URL has not been promoted; standard multiplier
02102001
	- means the last modified time is November 02, 2001
	- 02 is day, 10 is November for 0-based month, 2001 is year
028491
	- means the file size is 28,491 bytes or 28 kb
1004742542
	- exact last modified time - Fri Nov  2 15:09:02 2001
1004745182
	- exact last index time - Fri Nov  2 15:53:02 2001
u= http://www.whitehouse.gov/
	- URL
t= Welcome to the White House
	- literal title
d= Whitehouse.gov is the official web site for the...
	- literal description
uM= http www whitehouse gov
	- searchable URL
uT= welcome to the white house
	- searchable title
uD= whitehouse gov is the official web site for the
	- searchable description
uK= energy tax news w bush policies
	- searchable META keywords
h= skip to content text only skip to search
	- actual text of document
l=
	- searchable links; empty because "Index Links" General Setting is
		disabled by default

Advanced: Custom Coding

All record lines are generated by the text_record_from_hash subroutine. Incoming text lines are generally parsed by the parse_text_record subroutine, although the search algorithms themselves apply regular expressions to unparsed text lines themselves and they also have dependencies on the record format. Various other admin subroutines have dependencies on the text format. If you need to change the file format, do a search on "u=" to find all places in the code which parse text.


    "Format of the text index files"
    http://www.xav.com/scripts/search/help/1056.html