Format of the text index files
Each line in the index file maps to a single record, and each record maps to a single document. Each record consists of a 16-byte numeric header, followed by several delimited text fields, and terminated with a newline. The fields are:
- 16-byte numeric header
- document last modified time, expressed as number of seconds since 1970
- document last index time, as number of seconds since 1970
- URL, u
- title, t
- description, d
- searchable* URL, uM
- searchable title, uT
- searchable description, uD
- searchable keywords, uK
- document text, h
- searchable links, l
A "searchable" field contains text which has been run through the CompressStrip subroutine. For example, if a document title is "Sally Sue-Smith", it will be recorded literally in that form in the title field but the searchable title will be stored as "sally sue smith" (lowercase with punctuation stripped) to allow for exact literal matching against search keywords.
The 16-byte numeric header contains this info:
- 01-02; the promote multiplier, 01 for a standard document. The multiplier ranges from 01 to 99.
- 03-10; alternate last modified time, in the format DDMMYYYY; used for display purposes. The MM field ranges from 00 to 11 for January to December
- 11-16; the document size, in bytes. Ranges from 000001 to 999999. This will wrap for files larger than one megabyte, so a 1,600,400 byte file will show up as 600,400 bytes (working on a fix).
Example Text Record
Here is an example record for the whitehouse.gov main page. The record has been word-wrapped for readability, but in the file will be on a single line:
0102102001028491 1004742542 1004745182 u= http://www.whitehouse.gov/ t= Welcome to the White House d= Whitehouse.gov is the official web site for the White House uM= http www whitehouse gov uT= welcome to the white house uD= whitehouse gov is the official web site for the white house uK= energy tax news w bush policies h= skip to content text only skip to search president news and policies vice president history and tours {...} l=
Explanation:
01 - means this URL has not been promoted; standard multiplier 02102001 - means the last modified time is November 02, 2001 - 02 is day, 10 is November for 0-based month, 2001 is year 028491 - means the file size is 28,491 bytes or 28 kb 1004742542 - exact last modified time - Fri Nov 2 15:09:02 2001 1004745182 - exact last index time - Fri Nov 2 15:53:02 2001 u= http://www.whitehouse.gov/ - URL t= Welcome to the White House - literal title d= Whitehouse.gov is the official web site for the... - literal description uM= http www whitehouse gov - searchable URL uT= welcome to the white house - searchable title uD= whitehouse gov is the official web site for the - searchable description uK= energy tax news w bush policies - searchable META keywords h= skip to content text only skip to search - actual text of document l= - searchable links; empty because "Index Links" General Setting is disabled by default
Advanced: Custom Coding
All record lines are generated by the text_record_from_hash subroutine. Incoming text lines are generally parsed by the parse_text_record subroutine, although the search algorithms themselves apply regular expressions to unparsed text lines themselves and they also have dependencies on the record format. Various other admin subroutines have dependencies on the text format. If you need to change the file format, do a search on "u=" to find all places in the code which parse text.
"Format of the text index files"
http://www.xav.com/scripts/search/help/1056.html