Table of Contents
Current Known Problems (with latest stable version)
One reported and verified problem with Apache v2 + Windows 2000 Advanced Server whereby FDSE crawler could not crawl pages on local server. Root cause unknown. Easy code-level work-around is available. Web-based UI work-around may be create if this problem comes up for multiple people.
If the script encounters errors before it has a chance to load the strings file -- for example, if it has an error loading the strings.txt file itself -- then the error message will not be formatted correctly, because all error templates are stored in the strings file. Will be fixed with the new
The "max index file size" setting is not applied to file-discovery realms while they're being built, but the restriction is used on them after they're created (for deleting records, editing records, etc.).
The "Filtered Realms" realm type may be removed, since the new variable-pattern "Website Realms" do the same thing, only better.
The following features will be removed: runtime realms (possibly); various reverse-compatibility routines.
A migration tool will be provided to ease the transition when the time comes.
In a future release, all date formats may be forced to the format 2002-08-02 01:02:03. Currently many formats are allowed, including string-based ones, but the string-based formats may not localize properly for different languages.
Make it easier to restrict the public search HTML into a fixed-width table; currently requires customizing the HTML of both the header and footer template files, and is difficult to do in a WYSIWYG editor.
Investigate whether we can rename a file while it has read processes inside it. If so, update the LockFile package to take this into account and stop blocking reads while preparing for an update.
Use alternate sorting algorithm on @HITS array such that @HITS never has more than $RangeUpper elements. This will prevent the memory spikes and slow-downs associated with very large result sets, at the expense of some extra CPU up-front. Investigate whether having alternate algorithms would be advisable for subsequent searches where the total matches found value is known
Add user account system. Allow webmaster to create/edit/delete user accounts and assign them privileges. Each lower-level user account can do things like Add/Delete/Approve pages for a certain realm, or Add/Edit a certain set of banner ads (waiting for completion of
Improved discussion forum for people seeking help on script
Addition of a suite of "check-in" tests to verify basic script functionality before an update is made available on site.
Improved upgrade system, where all static files are in one folder, and all user-edited data files are in another folder. Add option of "live update" feature where version checks and upgrades are done from the Admin Page itself.
Add a search option to return all documents in the index, rather than only those documents matching the search terms (a work-around is documented in this help file)
Requested Changes - under consideration, but no time budgeted yet
Currently the script always searches within the title, description, keywords, URL, and text. Some users want to make searching within some of those fields be configurable by the admin (i.e., disable the finding of hits in the URL field so that "xav" doesn't return all the hits on this site). Idea suggested that the "Multiplier: Keyword" etc. fields have a default value of 1 which indicates that they will be searched; a value of 0 will mean they are not searched; a value of 10 or something would mean they are searched and given extra weight.
Add "Search within these results" feature
Support proper Boolean syntax, like "NOT foo AND (bar OR xxx) AND xav". This level of support will be necessary for implementing the "search within these results" feature, which will work like "new query AND old query".
Allow users to back up data (for those on slow network connections without shell access, where downloading a 100mb index file isn't practical)
Add some way to analyze word frequency in index files. This feature would help admins tweak their Ignore Words lists.
Add "search strings vs documents found" graph to Usage Statistics screen.
Add "edit multiple records" interface
Re-tool the Ads page to support users with many (100+) advertisements. Allow for adding more ads at once. Allow for displaying ads on multiple pages, 1-10, 11-20, 21-30, etc.
Offer search or sort features for organizing and querying the advertisements
Give more layout control to users - i.e., showing search box above the results rather than below them (complete, need to document)
Focus on organizing pages according to the URL name space, not the "realms" paradigm.
Add better support for Affiliate programs, like linking to Altavista for deeper searches or translation. Also, linking to Amazon for books sales, etc.
Provide a stand-alone script that logs click-throughs on search results listings. This could provide a usage statistic view that shows which pages are visited the most often, and could even present this view in the line_listing.txt template, so visitors would know how popular a page was before the clicked on it.
A vocal minority wishes to remove features, rather than add them, to make the script smaller, with fewer files, and fewer options in the UI.
Allow different sets of templates for the same engine. Sort of like different skins. (Currently possible via separate languages.)
In the search results listing, show which terms matched (and how many times). Useful for multi-term searches with optional terms.
Ability to eliminate duplication of web pages across realms.
A "More pages from this site" link (ala AltaVista) in the search results. In other words, a search only returns one result from each domain. The user can then click the "More pages from this site" link to return the rest of that domain's offerings.
Add a "Rebuild All Realms" and "Review All Realms" feature
Have template email messages for Approve/Deny of URL's added by visitors.
Fully parse Microsoft Office documents (.doc/.xls/.ppt/etc)
Allow changes to Promote Sites and Forbid Sites take effect immediately on the index files, rather than requiring a re-build of the entire index file.
Allow custom meta searches. For example, extract and make searchable fields like <meta name="translator" value="bill clinton"> <meta name="year" content="1984">
Link to #-links within a document.
Offer auto-deletion and pruning of search log. Investigate whether some type of roll-up is possible, whereby large text log disappears after X days but concise statistical information like search keyword frequency remains.
Optionally do not automatically remove pages when the crawler encounters an error when trying to re-index them; perhaps follow a 3-strikes policy.
Save error information to an error log, so that administrators can review problems that occurred while the crawler was running unattended
Option to review realm entries by domain, i.e., "xav.com", "nickname.net", "microsoft.com", with option to recrawl or delete all entries with a single click.
Single-click conversion of an isolated web page in an Open Realm into a Website Realm. Use a toolbar-style action link, similar to Edit | Crawl | Delete.
Allow for easier bulk-loading of sites to be indexed.
Allow for sorting by more criteria, such as title.
Have a "bypass Filter Rules" checkbox available for the Add New Page form in admin mode
These are feature requests that will not be implemented in the foreseeable future.
Add hidden bit to realm definition so that it will not appear in the drop-down menu.
Rejected because there is an easy HTML work-around that achieves the same thing (help topic)
Add vector-space model searches. This will speed up SQL-based instances.
Rejected because a separate product will be built to incorporate these separate algorithms. FDSE will remain a flat file search engine.
Internationalization; support for 2-byte languages like Japanese and Korean.
Rejected: xav.com has no expertise with these languages. Our dev environment (Perl 5.3 through 6.0) is not consistent in its handling of strings as Unicode.
Allow for synonym searching; ex: "explore" should match "explores", "exploring", and "explored" (configurable enable-or-disable feature)
Rejected: concept of making FDSE language-aware seems overwhelming. This feature could be implemented more easily in a SQL-based product; would be more difficult in a flat file product like FDSE.
v2.0.0.0074 - under development
v2.0.0.0073 - 2005-08-22 - stable
The admin control panel now uses a single browser window, organized with stylesheets, instead of the earlier frames-based system. The new control panel should load more quickly, and will make it easier to reload or bookmark pages within the admin control panel. The new control panel unfortunately will not render on Netscape 4 and earlier. Users must use newer browsers. (This change affects only the admin control panel; the public-facing search pages will continue to work in any browser.)
Fixed bug in which some URLs being added to a realm would be lost if one URL in the set was flagged by a "Require Approval" filter rule. See http://forum.xav.com/viewtopic.php?t=2121
Fixed bug in which script would fail with error "Undefined subroutine &main::header_print called" when search.pl could not locate the "searchdata" or "searchmods" folders. Now in those cases the script will fail with appropriate error message "unable to chdir to folder './searchdata' - No such file or directory".
The default values for General Settings "Max Characters: File" and "Max Characters: Text" have been increased from 64000 to 64000000.
Changed handling of settings "max characters: text", "max characters: title", "max characters: description", "max characters: auto description", "max characters: keywords". Each setting will now truncate the given string after the last space in the string, instead of exactly at the given number of characters. This prevents the max characters feature from cutting words in half at the string limit. Also, all strings that are truncated will now have "..." appended to the end.
Fixed bug in which a URL whose path portion contained a plus sign, like "B+C.html", would fail to be displayed via proxy.pl.
Fixed bug in which result rankings would often change based on the "Show Examples: Enable" setting. When that setting was enabled, a keyword found in the text would generate relevancy points based only on the number of times it was found in the text, and would not generate additional relevancy points based on additional occurrences in the title, keywords, description, and URL.
Now stripping content matching the patterns
<\?.*?\?>. This improves search performance when doing file-based indexing of ASP and PHP files.
The number of failed pages is now displayed as a link; clicking the link will list all failed pages.
The default filter rules installed with FDSE now use word boundary matching for adult content words. Previously the word "Essex" could be flagged as adult content for matching "sex"; now an exact word match would be required.
Updated French and Portuguese translations.
v2.0.0.0072 - 2004-04-04 - stable
Feature change: when doing an iterative rebuild for a "website realm with file system discovery", the script will now re-index all documents whose last-modified times are different than the last-modified time stored in the index. Previously the script would re-index documents whose last-modified times were more recent than the time stored in the index.
Feature change: under Manage Realms, the Delete and Edit interfaces had search interfaces which would list all records whose URL's matched a certain substring or Perl pattern. The output of each interaface was limited to 10 records in most cases, no matter how many actual records matched. Each interface will now print all matching records (up to 1000000).
Fixed bug which caused FDSE to fail with "Assertion ((sv)->sv_flags" error under some versions of mod_perl.
Fixed bug in which FDSE's admin headers "Content-Type" and "Set-Cookie" would appear in the document body under some versions of mod_perl.
Updated Danish translation.
v2.0.0.0071 - 2003-12-09 - stable
Updated Arabic, Dutch and Norwegian translations. Added a complete Danish translation was submitted. Thanks to all the contributors!
Added Cold Fusion integration library, to allow limited customization of the FDSE layout using that languages (help topic)
Fixed bug in which DaysPast rebuild parameter would not accept decimal values (bug introduced in 0064).
Added documentation for the feature that skips HTML documents which include a META refresh. The behavior can now also be fine-tuned or disabled (help topic)
v2.0.0.0070 - 2003-11-09 - stable
Fixed bug in which search engine would enter an infinite loop if "Ignore Words" like "and" were included in the query, and results were found, and "substring" searching was used instead of "whole word" searching. Bug introduced in build 0064 with the new highlighting code.
Fixed related bug in which keyword-only advertisements would be shown for all queries if "Ignore Words" like "and" were included in the query, and results were found, and "substring" searching was used instead of "whole word" searching.
Fixed bug in SMTP code which prevented FDSE from routing email through the ArGoSoft email server.
Updated Lithuanian translation to 58% complete.
v2.0.0.0069 - 2003-10-10 - stable
Fixed typo in cmd_admin.pl (bug introduced in 0068 release).
v2.0.0.0068 - 2003-10-09 - stable
Removed dependencies on the "strict.pm" and "vars.pm" standard Perl libraries. This should result in fewer problems for customers and a slightly improved execution time.
Updated Italian translation to 100%.
Updated Lithuanian translation to 37%.
Updated tips for the "text:keyword" attribute search. Previously was inaccurate/misleading:
"Finds pages that contain the specified text in any part of the page other than an image tag, link, or URL. The search text:cow9 would find all pages with the term cow9 in them"
"Finds pages that contain the specified text in the body of the document. By way of comparison, searches without the "text:" attribute will scan the URL, title, links, and META tags as well as the document body."
Added features to limit automated URL submissions (help topic)
All page-specific help links now open in the _blank window; previously behavior was inconsistent.
The default "Admin Pages" filter rule now blocks access to the "/." string, to better prevent access to dot-files.
Greatly improved keyword-matching in advertisements. Keyword matching now uses the same algorithm in advertisements as in documents.
v2.0.0.0067 - 2003-09-26 - stable
Updated Turkish translation to 99%.
Added partial Lithuanian translation, 27%.
Fixed bug in which visitor-added URL submissions would automatically create a new open realm if the submission form was customized to use a Realm = "" parameter.
Fixed bug in which visitor-added URL submissions could be targetted to realms other than open realms if the submission form was customized to use the appropriate Realm = RealmName parameter.
FDSE now checks that any "Realm" parameter passed to it matches a valid realm name.
Fixed bug in which index files for realms built with "file system discovery" would become corrupted if the index build time took longer than 15 minutes, and if new files were injected into the web site during the 15 minute interval. FDSE would rebuild its cache file list every 15 minutes, and the indexing order would become offset if the number of files to be indexed grew between builds of the file list. The corruption was minor, in that the index would contain an overlap. The consequence was that search results might contain a few duplicates, and that the incremental rebuilds using "revisit old" would fail with an error complaining about the index not being in alphabetic order.
The new, improved behavior is to use the same file list for the duration of an indexing process.
Fixed HTML parsing bug in which links of the form
<a href='foo.html' onMouseOver='MyImage.src="bar.gif";'>would incorrectly extract "bar.gif" as the link target. This happened because FDSE scanned for both "href"- and "src"-style attribute strings, and placed a higher precedence on double-quoted strings. The new behavior scans only for "href" attributes of a|area|base tags and only "scr" attributes of frame|iframe tags.
Fixed bug in which the setting "Max Characters: Keywords" was not being used. Feature was broken in builds 0064, 0065, and 0066.
v2.0.0.0066 - 2003-08-17 - stable
Updated Greek and Italian translations.
Feature change: the keywords "and" and "or" are no longer aliased to "+" and "|". They are now treated as just normal keywords (reasons)
Fixed bug that caused FDSE to fail with "../searchmods/common_parse_page.pl has too many errors" in admin mode when run under Perl 5.004. Bug was introduced in FDSE 0064.
Fixed bug in which FDSE would not index filenames whose paths contained the plus sign, like "foo+bar.html". Bug introduced in 0061.
v2.0.0.0065 - 2003-07-30 - stable
Fixed bugs in which binary file conversion would fail if there were spaces or other metacharacters in the binary converter path or the binary file path.
Fixed performance problem in which search results took about four times longer to render.
v2.0.0.0064 - 2003-07-25 - stable
Removed support for mysql data storage. This move has been planned for two years. It reduces the FDSE code footprint by about 25%, which should improve speed, especially for admin functions.
Improved parsing of HTML entities. Now correctly handling entities that do not have a closing semicolon, like "<root>".
Improved highlighting code in the descriptions and excerpts on the search results page. Previously, highlighting would not correctly handling wildcards, phrases, words containing entities, and would not properly reflect the whole-word versus substring nature of the search. These problems are now all fixed. The improved highlighted code still needs to be ported to the proxy.pl viewer, however.
All strings from "settings_desc.txt" have been moved into "strings.txt", reducing the number of files in the distribution, and making things easier for translators. The template files admin_template.html, admin_style.inc, and admin_navbar.txt have also been eliminated.
Previously, whenever a network error occurred while crawling a file, FDSE would include a link labeled "help with network errors" when returning the error message. Now, that link will only appear for administrators. When visitors are adding their own URL, they will not see the link included with any error messages that come back.
Rewrote all code, interfaces, and help files for handling binary files. Added support for searching MS Word files with help of Antiword utility (help topic)
The User Interface => Advanced: Edit Templates pages now reference the default text of each template. This should help users who accidentally mess up their template text and who want to revert to the original.
In FDSE versions 2.0.0.0042 through 0063, the META keywords were "reduced" by splitting them all into separate words, removing duplicates, and sorting them in alphabetic order. Thus keywords "Perl script CGI script" would be mapped to "CGI Perl script". This algorithm had bugs in that removing duplicates happened before the case insensitivity rules were applied, and so keywords which originally had different case would end up existing as duplicates in the final version. Also, by re-ordering the keywords, some phrase matches would be lost (the phrase "cgi script" would no longer work in the earlier example). As of FDSE 0064, the META keywords will be accepted as is. No attempt will be made to remove duplicates.
Updated Chinese, Norwegian and Russian translations. Added new Greek translation.
Fixed bug in
parse_meta_headerinvolving extraction failures when multiple META tags were present and content|name attributes were in non-standard order.
This change breaks reverse compatibility for customers who are using mysql data storage. Such customers should review the help topic for options.
v2.0.0.0063 - 2003-04-29 - stable
Updated Dutch, German and Russian translations.
Now calling URL Rewrite Input Filters on each URL in a redirect chain, instead of only on the first URL.
v2.0.0.0062 - 2003-04-09 - stable
Fixed bug in which the "Manage Realms" table on the main admin page would not render properly if the only realms present were of type "File-Fed".
Fixed bug in which the search engine would completely fail when the "Show Examples: Enable" setting was enabled, and when there was a runtime realm being searched. This bug had been introduced in the 0056 release from 2002-11-08. The error message associated with the failure is "Global symbol "$u" requires explicit package name".
Fixed bug in which the robots.txt file would always be parsed for website realms using file system discovery, even if the "crawler: rogue" setting was enabled.
Fixed bug in which requests for the starter file in a File-Fed realm were done with "max characters: file = 2048000" regardless of actual "max characters: file" setting. This was causing truncation and data loss for starter files larger than 2MB. The new behavior will increase "max characters: file" up to 16MB for the starter file if and only if the user's custom setting has not already been increased above that level. This new behavior will more likely read in the entire starter file, and in cases where it doesn't, the user will now be able to work around the problem by assigning a very large "max characters" setting. The default "max characters: file" setting remains 64kb for normal requests.
Fixed admin error handling detail that caused "uncontrolled exit" warnings to arise sometimes when they shouldn't.
Fixed bug in which URL hostnames were not being forced to lowercase. Bug introduced in 0061 with new uri_parse sub.
Fixed encoding bugs involving file system crawler realms and files/folders whose names contain "#", "%" or "&".
Fixed bug in which host header was not sent in some cases by proxy.pl. Would cause errors or incorrect pages to be returned when proxy.pl was used to view sites that depend on IP-less virtual hosting.
Fixed bug in which the "cmd_admin rebuild All" command would only rebuild the first realm in the set, in most cases. Bug was introduced in version 2.0.0.0056.
Changed batch size behavior for command-line rebuilds. For FDSE 2.0.0.0032 through 0053, command-line batch size was fixed at 100. For 0054 through 0061, batch size was the lesser of 100 and the "Crawler: Max Pages Per Batch" setting. In 0062, the batch size is the same for web-based reubilds and command-line rebuilds; both use the General Setting "Crawler: Max Pages Per Batch".
Updated Arabic (99%), Dutch (99%), Finnish (13%), German (99%), Norwegian (99%), Russian (59%) translations. Added Bosnian (41%).
v2.0.0.0061 - 2003-01-16 - stable
Updated Polish translation (from 8% to 16% complete).
Updated French translation (from 85% to 100% complete).
Fixed bug in which conditional statements within templates would sometimes not execute under Perl 5.8.
Fixed bug in which website realms which use the File System Discovery method would fail under the "Revisit Old" command if a given folder contained a subfolder named "foo" and also a file named "foo.html". The problem was due to the file-listing sort order being "foo", "foo.html" and the URL sort order being "foo.html", "foo/". This difference in sort order would cause a sort order error during incremental builds.
Fixed bug in which a link like
<a href="#foo">on page http://xav/index.html would be resolved as http://xav/#foo instead of correct http://xav/index.html#foo.
Because there have been a variety of bugs like this in recent builds, the URL-parsing routines have all been rewritten. uri_parse and uri_merge now replace older functions clean_path, parse_url_ex, parse_url, and GetAbsoluteAddress. This change will involve some different behavior with certain URL's, and will involve some new error messages, but the new behavior should be more standards-compliant.
The upper bound on General Setting "Max Characters: URL" has been increased from 255 to 2048 characters. The 2048 is the limit for Internet Explorer. The default "Max Characters: URL" value remains 128.
For files with multiple META tags with the same name in the first 4096-byte block, FDSE will now match the first tag, rather than the last tag in the block.
v2.0.0.0060 - 2002-12-11 - stable
Fixed bug whereby the "Next" matches would continue to show results 1-10. Bug was introduced in 0058 release.
Updated Dutch translation. That translation is now at 100% (thanks Roel Mulder!)
v2.0.0.0059 - 2002-12-10 - stable
Fixed "ltr" and translation credits for Dutch translation.
v2.0.0.0058 - 2002-12-09 - stable
Added new General Settings "Use DBM Routines" and "Use Socket Routines". On platforms where DBM or sockets are not supported, these settings can be disabled. This will cause FDSE to hide any features with those dependencies. This will make it less likely that end users or administrators will experience errors.
Fixed two minor bugs with "plural match" feature.
A complete Dutch translation has been contributed.
Added "substring search" feature. Allows keyword "eat" to match "beating" and "neat". See Admin Page => User Interface => Default Substring Match to enable or disable (help topic)
For "include-by-name" searches, the names of the included realms are now logged, instead of literal "include-by-name". Multiple realm names are pipe-separated, like "My Realm 1|My Realm 2|www.xav.com".
v2.0.0.0057 - 2002-11-10 - stable
Fixed bug in which a huge string of debug H1 data was printed on the "Edit Realm" page. This bug was introduced in the 0056 release.
Updated Norwegian translation. This translation is now 100% (797 of 797 strings).
Improved user interface for deletion of DBM search log files.
Changed rules on inclusion of languages. Now basic download contains all 18 language packages (total product download is 1600 KB). A smaller English-only download is also offered (200 KB).
The auto-installer, on the other hand, contains checkboxes allowing the user to select whichever languages are desired. Only English is selected by default. During upgrades, the auto-installer detects any previously-installed languages and will select them as well. The end user can customize this default selection as needed.
v2.0.0.0056 - 2002-11-08 - stable
Improved security by allowing users to restrict admin access to certain IP addresses (help topic)
Changed templates to meet US government Section 508 standards on accessibility (help topic)
Added new General Setting "Multiplier: URL" to fine-tune your ranking algorithm. Works the same as existing "Multiplier: Title", "Description", and "Keywords".
Visitors can now select the interface language (help topic) Thanks to Ian Dobson for the code.
Administrator can now display a "most popular searches" list on the output pages (help topic) Thanks to Ian Dobson for the code.
Added advanced option whereby administrators can customize the pattern used for filtering URL's within a website realm. Thanks to Ian Dobson for the code.
Removed settings "Allow Filtered Realms" and "Allow Index Entire Site" and replaced with single setting "Show Advanced Commands". The new setting has the same effect as the previous two, and also controls whether the administrator will be able to customize the filter pattern on website realms. All settings are disabled by default; migration code will enable "Show Advanced Commands" if either of the older settings were enabled.
URL Rewrite Rules now support $1-style interpolation and supports uc/lc functions. Thanks to Brian Renken for the code.
Updated proxy.pl tool. Proxy will now do a full redirect (without keyword highlighting) if the final document is a PDF, DOC, or XLS file, or if the document returns a non-text Content-Type, or if the document is an HTML frameset page. Corrects previous behavior where proxy.pl would display a corrupt file or a blank page.
A complete Arabic translation has been contributed. The Italian translation has been updated.
Added host-specific corrections to the DOCUMENT_ROOT variable. Correcting for netfirms.com, virtualave.net, portland.co.uk. Should reduce the number of errors when users of those hosting companies attempt to create a file-system realm.
Added "help with network errors" link to help provide direction to users who are on web hosting providers that have disabled sockets privileges.
Added "help with server-killed processes" link to help users whose CGI processes are being killed by the web hosting providers due to excessive resource usage.
Created the FDSE User's Guide. Now include a link to it from the main Admin Page during the first visit (or whenever there are no realms defined).
<label>tag so that clicking the text near a radio button or checkbox will activate it.
Improved file size presentation; now using "bytes" label or 1024-byte "KB" abbreviation. Previously was inconsistent with 1000-vs-1024, and used "kb", "k", or no units.
Added "Xpdf test" interface to help identify and fix problems that people experience when integrating Xpdf and FDSE.
All calls to
dbmopennow wrapped within
evalhandlers to protect against DBM's tendency to die.
Corrected tips.htm help file text relating to wildcard behavior (help topic)
Fixed several XHTML compatibility bugs.
Fixed bug in which web pages that had both a "noindex,follow" META tag and which were covered by a "noindex,follow" Filter Rule would be treated as "noindex,nofollow".
General Setting "Network Timeout" was previously restricted to the 1 to 100000 range for the number of seconds. Now the range is 0 to 100000, and a setting of zero will cause FDSE to completely bypass the
IO::Selectcalls for controlling timeouts.
Attribute searches now support whitespace between the attribute and keyword. For example, the search "title:foo" may now be entered as "title: foo". This was changed due to confusion faced by some end users and also to match the Altavista search engine upon which this feature is modeled.
Fixed bug whereby FDSE would set cookies on each request with the user's "maxhits", "Match", "p:pm" and "p:lang" values. On subsequent requests, FDSE would read in search settings from these cookies if the setting was not otherwise declared in the form. This led to confusion for users who had multiple browsers, some with cookies and some without, who would receive different search results based on the hidden cookie values stored by each browser. Problem was usually noticed by webmasters who were experimenting with different custom search forms and different default settings for language and "Match", while also testing with different browsers.
Fixed bug in which changes to the "Promote" value were being ignored on the "Edit Record" page. Bug existed in builds 0053-55.
Fixed bug involving the first entry added to the "Strings" or "Patterns" field of system filter rules "Always Allow Pages", "Forbid Sites" or "Promote Sites". The first entry would be duplicated to both the "Strings" and "Patterns" fields of all rules. This bug was particularly troublesome because "Forbid Sites" was the most common rule used, and yet having the string in both it and the "Always Allow" rule would cause the latter to override, and the string would not be forbidden (it would actually be promoted). Bug would only manifest when the "filter_rules.txt" file was missing, and the system rules stored in-memory. Once any rule was saved, they would no longer use shared memory.
Fixed bug in which relative links were not being properly extracted from web pages whose URL's contained a slash in the query string. For example, the link "index.html" on page "http://xav.com/?foo/bar" would be incorrectly resolved as "http://xav.com/?foo/index.html".
v2.0.0.0055 - 07/10/2002 - stable release
Updated German translation.
No longer writing to the file "searchdata/total_pages_indexed.txt". That file had not been updated with the correct information for the last several versions, and rather than fix the feature, it has simply been removed.
Two security fixes: the "Rank" and "Match" inputs are now being error-checked to make sure they are numeric. Previously, they were accepted without validation, and it was possible to enter HTML strings in the parameters. Thanks to val2 for reporting this.
v2.0.0.0054 - 06/16/2002 - stable release
Added Latvian, Norwegian bokmål, Slovenski and Tagalog translations. Updated Italian, Spanish, and Swedish translations.
The "request_method" variable was removed since it could only have an effective value of "POST". Now, all admin functions use the POST method, while the public search form uses the GET method. This allows users to bookmark search results and allows better keyword-tracking-by-referrer options for webmasters.
When the administrator edits a record to customize the title, description or keywords, there will now be an option to persist the changes forever, so that the customized values will remain even after re-indexing the document.
Improved regular expression for extracting last-modified dates.
Added "Sorting: Time Sensitive" feature to rank more recently-modified documents first (help topic)
Added "Logging: Enable" setting to allow enabling/disabling of the search logging feature.
Added "Delete All Log Entries" button to the Usage Statistics admin page, to allow web-based clearing of the log.
Added new realm type "Filtered Realm" (help topic)
Added feature whereby visitor-added URL's (and visitor email addresses) can be saved to "submissions.csv" file.
Corrected handling of multi-select lists in custom search forms.
Feature change: when calling the Revisit Old command on a File-Fed realm, the starter URL will always be fully re-indexed. All new URL's will be indexed and any that no longer exist on the starter page will be dropped. Previous behavior was to only re-index old URL's, ignoring the starter URL.
Feature change: the "Add New URL" form on the main page will now accept URL additions to website realms, open realms, filtered realms, and file-fed realms. Previously it only accepted URL additions to open realms. This will make it easier to add a single new URL to a website realm without needing to re-index the entire site.
Now properly calls html-decode on links which use "foo&bar" syntax.
All output is XHTML Transitional. Previously used HTML 4.
Fixed major bug in which the error "seeking to position X but file is size Y" would appear about once per 20 iterations of the BuildIndex process.
Fixed the "remove-from-work-queue" feature. Now if the script suffers multiple timeouts when requesting a URL, it will drop it from the list rather than continuously trying to index it.
Fixed bug whereby email notification of visitor-added URL's would only work via an SMTP Server, and not via the alternate Sendmail Program.
Fixed bug in URL formatting; previously would treat "http://xav.com?foo" or "http://xav.com#foo" as errors. Now properly inserts the "/" delimiter before the "?foo" or "#foo" portion of the URL, yielding "http://xav.com/?foo" or "http://xav.com/#foo".
Fixed bugs with running under the Perl -T taint switch.
Fixed bug in which "base URL" was determined incorrectly when extracting links from an HTML document when the document is effected by a "no update on redirect" rule.
Fixed bug in which "index,nofollow" Filter Rules were not working.
Fixed bug in which realm rebuild operation would loop infinitely if the realm contained only one URL and that URL was affected by a "require approval" Filter Rule.
v2.0.0.0053 - 11/26/2001 - stable release
Added Turkish translation. Updated German and Italian translations.
Added "index,nofollow" and "follow,noindex" filter rules. Improved documentation for all filter rules and added context-based help links to the "create/update filter rule" page.
Fixed long-standing bug in which document sizes were recorded only up to 999,999 bytes. Changed size display in the output so that all files are shown as "size k" rather than having some show "1200 bytes" and others show "5 kb". Now using setting "User Interface -> Number Format" to render sizes greater than 1000.
Fixed bug in which the Timeout variable was not being used when rebuilding realms with the crawler.
Fixed bug in which the error cache was being cleared with each iteration during a multi-page rebuild, instead of only once at the beginning of the rebuild.
Fixed bug in which advertisements whose HTML contained "<form>" tags would prevent the advertisement system interface from accepting any additional data.
Fixed bug in which the "impressions per day" count for advertisements was being incorrectly calculated as "impressions per second" and thus was showing "0.000" in most cases.
Fixed bug in new plural matching feature; "family" and "families" now equivalent.
Fixed bug where General Setting "Crawler: Minimum Whitespace Ratio" could be assigned to "0.0" but not "0". Now accepts any string of the format: \d+\.?\d*
The optional "maxhits" form element, when present in searchform.htm, now properly defaults to the "Hits Per Page" User Interface Setting.
User settings for "p:pm", "Match", and "maxhits" are now persisted with cookies (experimental feature)
Started on XHTML compatibility project. Currently about 30% of HTML is XHTML-compatible.
v2.0.0.0052 - 11/12/2001 - stable release
Added new security settings to the proxy.pl/proxy.cgi utility. After each install or upgrade, users will have to customize the proxy.pl tool and set $SECURITY_ENABLE = 1 (help topic)
Added support for MP3 file format
Added support for inputting extra search terms from secondary form fields (help topic)
Added support for alarm() on Unix systems. Will time out network operations which hang for more than 15 seconds.
Removed "quot nbsp lt gt amp" from the default "ignore words" list, since these strings are now automatically stripped by a separate entity-stripping subroutine.
The error cache is now cleared before rebuilding a realm. Should prevent scenario where URL's are blacklisted forever due to any single failure.
Added "approximate plural forms" feature for Advanced Searches (help topic)
Notification emails for visitor-added web pages will now include all name-value pairs submitted by the visitor, making it easy for FDSE administrators to request additional info with their submissions.
FDSE "language packs" can now be installed from the "Admin Page" => "User Interface" page. As new languages are made available, they will most likely be available as optional add-ons from that page, rather than being installed by default.
Updated Portuguese and Italian translations; added new Serbian and Finnish translations that can be installed from the "User Interface" page.
Streamlined display of realms in admin interface for better display on smaller screens
Replaced all instances of NOBR with NOWRAP
Fixed bug with how UpdateIndex deals with files that are excluded based on their contents (i.e., due to containing a redirect, being too small, having too little whitespace, or being outed due to a Document-Text filter rule). Was causing "Revisit Old" command to index files it shouldn't, or complain about a failure of alphasorting.
Fixed bug whereby the underscore character was not being converted to a blank space in the index files, but it was converted to space in search terms. Would cause a search for "mx_lookup" to not return the URL "xav.com/mx_lookup.pl". This bug was introduced in the 0045 release when the case- and accent-sensitivity handling was redesigned.
Fixed bug involving the use of proxy.pl to display redirected web pages; previously BASE HREF was being set to the initial URL instead of the final URL
v2.0.0.0051 - 09/26/2001 - stable release
Initializing all global variables and clearing %INC for better compatibility with Apache mod_perl
Fixed bug in which Netscape 4.x users were not able to edit the "General Settings" which appeared on the Manage Realms or User Interface pages. This bug was due to an empty space in the URL action. Those users would see the Netscape error "The parameter is incorrect" when trying to edit. This bug had been introduced in 0049.
Removed workaround-support for versions of Perl which lack the crypt function
Setting General Setting "Timeout" to zero now means "don't use any timeout"
Improved error handling in proxy.pl and added the
%mapsdata structure (see source of that program)
Improved parse_meta_header; can now extract in the form META content= NAME= in addition to earlier META NAME= content= format.
Updated help files and install docs to consistently use the .pl file extension
Fixed bugs in robots exclusion rules. Now support multiple User-Agent headers prior to a single set of Disallow: directives, as per the extension to the robots.txt standard. Now properly respond to the first matching ruleset for which there is a User-Agent substring match. If no substring match, then respond to the User-Agent: * ruleset. Previously would incorrectly join all Disallow paths for any User-Agent which matched and also for *. This prevented the forbidding of all robots but allowing of FDSE with a set like "User-Agent: * Disallow: / User-Agent: fdse Disallow: /cgi-bin/" since FDSE would stay away from both the / and the /cgi-bin/ paths instead of only the /cgi-bin/ path.
v2.0.0.0050 - 09/04/2001 - stable release
Updated Swedish, Spanish, and German translations
Added "record_realm" replacement value for use in the line_listing.txt template (help topic)
Added "proxy.pl" utility script
Fixed bug whereby each search of a runtime realm would create a temp file that was not being deleted.
Fixed bug whereby the rename of a realm was not be reflecting a realm-specific Filter Rules, causing those rules to no longer apply
Click here for changes to version 2.0.0.0049 and older.