File pathname restrictions when using binary converters
When using binary converters (XPDF and Antiword), FDSE needs to shell out to the separate converter application and pass it a filename as a parameter. Problems arise if the filename to be passed contains shell metacharacters, such as the space, semicolon, ampersand, backslash or quote.
Passing of filenames is handled differently based on the type of realm and the filename:
-
web crawler discovery -- used with the default Website Realms with Crawler Discovery, and for Open Realms and File-Fed Realms
When the FDSE web crawler needs to convert a binary file, it always uses a "safe" temp file. The temp file is created specifically for the purpose of shelling to the converter application, and it uses a numeric filename. There are no file pathname restrictions in this case.
-
file system discovery -- used with Website Realms with File System Discovery, and for Runtime Realms
When the FDSE file system crawler needs to convert a binary file, it usually doesn't create a temp file, since the original file is already available.
If the pathname of the binary file consists entirely of alphanumerics plus the space, colon, period, underscore, and hyphen (the "trusted charset") then it will be a considered a "safe" filename. The original pathname will be passed to the converter. The pathname will be double-quoted if it contains spaces.
However, if the file pathname contains characters outside of the trusted charset, then a copy of the file will be created, using a safe temp filename. That temp file will be passed to the converter instead.
For example, if you have some Word documents in a folder named "Terms & Conditions", FDSE will recognize the ampersand as an untrusted character likely to cause problems when shelling. It will make temp copies of every Word document in that folder while indexing.
The moral of this story is this: if you are using binary conversion, and if your binary files contains metacharacters in their pathnames, and if you are using file system discovery, then FDSE is going to do a bunch of extra work. In that case, you should use a more simple naming convention for binary files.
History: this logic was added to FDSE version 2.0.0.0065. Prior to that version, untrusted characters in pathnames would not be escaped, and would cause binary indexing errors.
"File pathname restrictions when using binary converters"
http://www.xav.com/scripts/search/help/1189.html