Reasons a URL may not be included in the index
Below is a list of common reasons that a document may not be indexed:
The site's "/robots.txt" file prevents access to the document.
The document has a "fdse-robot" or "robot" META tag with a value of "none" or "noindex".
The document is being filtered by a Filter Rule with action "deny", "require approval", or "follow,noindex".
The document file size is less than the "Minimum Page Size" setting.
-
FDSE decided that the document was a binary file, not a text-based file.
When using web crawler indexing (the default), FDSE analyzes the document's Content-Type header and requires that it match "text". Common Content-Type headers are "text/html" and "text/plain". Binary content returns headers such as "image/gif" and "application/msword".
When using file system indexing (only done with two realm types: Website Realms with Filesystem Discovery and Runtime Realms), FDSE uses the Perl -T switch to test whether each file is binary or not. You may use the "AllowBinaryFiles" General Setting to override this test. File system indexing is also limited to files whose extensions are listed in the "Ext" General Setting.
In certain cases, binary files will be indexed. See Searching binary files for more information.
-
When indexing an entire web site using the web crawler (which is the default technique used with the "Add New Site" form), FDSE will only discover pages that are linked from the main page. Pages that are not linked, directly or indirectly, from the main page will not be indexed. Even if links are present, one of the following problems may prevent indexing:
The crawler only extracts the first 64000 bytes of large pages. Links that only appear towards the bottom of large pages will not be seen. (The size limit can be configured at Admin Page => General Settings => Max Characters: File and Max Characters: Text.)
Links must be in standard HTML. The crawler scans for "A HREF", "AREA HREF", "IFRAME", and "FRAME" tags. The crawler cannot extract links from Javascript navigation, from drop-down select menus, nor from forms. For more information on this, see Crawler does not find links within client-side content (such as Javascript menus).
Fully qualified links must exactly match the original Base URL. For example, if your Base URL is http://www.xav.com/scripts/, then the crawler will not extract links to http://xav.com/scripts/test/ (different literal hostname). It will not extract links to http://www.xav.com/test/ (different folder). It will not extract links to http://www.xav.com/Scripts/test/ (different case in folder name).
There are many possible reasons for a URL to not be included in the index. The easiest way to see which indexing problems exist, if any, is to try to add the URL using "Admin Page" => "Add New URL". If there is a problem, then a detailed error message will be returned.
"Reasons a URL may not be included in the index"
http://www.xav.com/scripts/search/help/1011.html