Home > Fluid Dynamics Search Engine > Help > 1119

How to use FDSE to search Arabic and Hebrew text

This help article describes how to configure FDSE to best search Arabic and Hebrew text.

Note that FDSE only officially supports Latin characters sets. I am not familiar with either Arabic or Hebrew. The information presented here is based on the experiences of those who've translated the product to Arabic. Additional comments and corrections to this information are welcome. I may not be able to answer questions because I don't understand the languages nor some of the finer points of RTL. - Zoltan

Follow these steps to support Arabic and Hebrew text:

  1. Make sure you are using FDSE version 2.0.0.0056 or newer. Those newer versions include support for the %dir% variable for controlling text direction.

    If you are upgrading from an earlier version of FDSE, you may want to manually update your template files as well, since they have been updated to include support for the %dir% and %content_type% variables.

  2. Within the strings.txt file, the third line contains the content-type header, and the fourth line contains the text direction. For Arabic, these values should be:

    text/html; charset=windows-1256
    rtl

    For Hebrew, the values should be:

    text/html; iso-8859-8
    rtl
  3. Arabic and Hebrew include diacritics (also known as "Arab vowels"). These are marks placed above or below letters which typically represent vowel sounds or other modifiers. These are primarily used for children's and religious/official text, though some mainstream sites are beginning to use them. When people type a search term they will not type the diacritics/vowel characters. Use the character mapping to delete those diacritics from the indexed text. Open the library file "search/searchmods/common.pl" and find the subroutine create_conversion_code. Here is an example of how a character is normally handled:

    240 => [ 'o', 'o', 0, 0, 'Small eth, Icelandic'],

    Replace this with:

    240 => [ '', '', '', '', 'Vowel'],

    By using this null string value '' the diacritic character will be deleted completely, whereas using the '-1' value will replace it by a space, which will lead to incorrect results in the result page.

    For Arabic with the windows-1256 character set, I believe the vowels are 240-250 except 244, 247, 249.

    For more help with create_conversion_code, see Character conversion settings.


    "How to use FDSE to search Arabic and Hebrew text"
    http://www.xav.com/scripts/search/help/1119.html