Home > Fluid Dynamics Search Engine > Help > 1095

Character conversion settings

The Fluid Dynamics Search Engine has two Character Conversion settings which are meant to increase the probability that a visitor will find the search terms he is looking for.

The setting Character Conversion: Case Insensitive will cause all searches to be done without regard to case. In practice, this means that all documents indexed and all search terms are converted to lowercase interally. Comparisons are then made using these internal lowercase formats.

The setting Character Conversion: Accent Insensitive will cause all searches to be done without regard to accents. When this option is enabled, a special accent-reduction algorithm is used to create accent-insensitive internal formats.

To see how certain characters will be translated, go to "Admin Page" => "User Interface" => "Language and Locale Settings" => "Character Conversion Map". From there you can adjust your character conversion settings, and review the mappings available for all possible settings.

Note: after updating character conversion settings, all index files must be rebuilt. Otherwise, many searches will fail to bring up results.

See also:

W3 standard on HTML 4 entities
www.w3.org/TR/html4/charset.html

Microsoft DHTML reference on MSIE supported entities
msdn.microsoft.com/workshop/author/dhtml/reference/charsets/charset1.asp

FDSE always uses the ISO Latin-1 character set, corresponding to the first 256 entries of the Unicode character repertoire. On browsers with a different default character set, the "entities" and "character" columns may display different characters. To search documents with a different character set, the best approach is to set both Case Insensitive and Accent Insensitive to 0 (unchecked).

Custom Mappings

To modify the character mappings, edit the file "searchmods/common.pl". Find the subroutine create_conversion_code. Inside this subroutine are two hashes, %base_charset and %extended_charset. For most character sets, only the extended charset needs to be modified. Example:

my %extended_charset = (
	...
	220 => [ 'ue', 'Ue', v252, 0, 'Capital U, diaeresis / umlaut'],

The first value is the character byte number, 220 or Ü in the above example. It is followed by what the character should be converted to under accent insensitivity, with and without case. The next two values are the conversion values under accent sensitivity, again with and without case. The final value is a text description of the character, displayed on the admin page.

In the example above, the format v252 is used to represent the 252nd character, ü. This format is used because some text editors will mangle raw extended characters.

Other than strings and the vXXX format, two special values are available. The value "-1" means that this character is a non-word character and should be stripped. The value "0" means that this character should be retained with no changes.


Custom HTML Entities

To modify the character mappings, edit the file "searchmods/common.pl". Find the subroutine create_conversion_code. Inside this subroutine is the hash %named_entities which stores all "entity name" <=> "character number" mappings.

Example:

my %named_entities = (
	'#338' => 140,
	'#339' => 156,
	'#352' => 138,
	'#353' => 154,
	'AElig' => 198,
	'Aacute' => 193,

History: the character conversion approach described here was added in FDSE version 2.0.0.0045. Prior to that version, the settings Accent Sensitive and English Language Searching were used to control case and accent handling, but they did a poor job of it.


    "Character conversion settings"
    http://www.xav.com/scripts/search/help/1095.html