v2.0.0.0049 - 08/24/2001 - stable release
Updated Dutch translation
Added Swedish translation
Added Romanian translation
Corrected handling of META tags in XHTML documents (i.e. those with
<meta name="x" value="y"/>). FDSE now correctly handles XHTML documents in all respects.Improved the admin user interface for reviewing and editing system settings
Users can now set "Sendmail Program" to "none", instead of being forced to choose from the list
Fixed bug in SetDefaults that was hanging on default values which contained metacharacters
Fixed bug in Usage Statistics => List, where sorting by date-time would do an alpha sort on the date-time string, rather than sorting the dates themselves
Changed default setting: "Ignore Words" no longer includes integers 0, 1, 2, and 5.
Changed default setting: "Crawler: Follow Query Strings" from false to true (0 to 1)
Changed default setting: "EXT" now includes the "shtm" extension
Note: with this release, I evaluated FDSE code against the upcoming 9-to-10 digit extension in the
time()return value that will occur on Sept 08, 2001. FDSE uses primarily variable-length records and so this should not be a problem. All code that does use fixed-length records has already been using 10-digit time values from the beginning, and just pads a leading zero for pre-2001-09-08 dates. This problem should not affect FDSE.
v2.0.0.0048 - 07/26/2001 - stable release
Optimized handling of "Approve Pages" for large data sets
Added custom error message for http:// pages that redirect to https:// or ftp://
Improved error handling in cmd_admin.pl "rebuild" function
Fixed bug in SetDefaults where INPUT tags were required to be in upper case
v2.0.0.0047 - 07/24/2001 - stable release
Fixed serious data corruption bug (introduced in 0046) -- crawler was catching an extra line break in the Content-Length header and insert it into the database
Fixed %terms% syntax in default "header.htm" template
Exposed the bolding of search terms in the document description as CSS class "B.hl1"
v2.0.0.0046 - 07/19/2001 - stable release
Added support for PDF files (help topic)
Added support for conditional statements in FDSE templates (help topic)
Standardized support for %realm% and %terms% variables in all FDSE templates
v2.0.0.0045 - 07/10/2001 - stable release
Updated German translation
Updated handling of case sensitivity and accent sensitivity (help topic)
Stripping certain HTML tags down to null; thus,
<BIG>T</BIG>hewill now be properly mapped to "the", not "t he"No longer extracting a href links that have been commented out
Added %relevance% replacement value for use in line_listing.txt template
Fixed bug involving deletion of index files when running under Perl's taint mode
All versions of FDSE now ship with the -w switch enabled. With this change, scripts transfered to Unix servers in binary mode will continue to function rather than return internal server error.
The default path-to-perl has reverted back to /usr/bin/perl, rather than /usr/local/bin/perl as in releases 0036 through 0044. This reflects user feedback on the most common location of the Perl interpreter. The auto-installer will continue to test both locations. Either location has always worked for the majority of web servers
There have been several updates to the help files and translation system.
v2.0.0.0044 - 06/05/2001
Updated German translation
Now shipping the translate.pl tool with the product, making it easier for users to customize their translations (help topic)
Feature change: no longer doing dynamic lookup of SMTP server. Also, users must select from a set of pre-approved sendmail programs; setting a custom sendmail program is now more difficult (help topic)
Feature change: added support for robots exclusion comment tag in addition to FDSE:ROBOT HTML tag (help topic)
Fixed bugs involving searching URLs with embedded spaces, like 'foo bar.html'
Fixed bug whereby last-modified dates beyond 2038 would cause the indexing process to go into an infinite loop
Fixed bug where permissions were not being set on files "auth_tokens.txt", "realm.pagecount", and "total_pages_indexed.txt".
Fixed bug where "//" was being mapped to "/" in the entire URL path; now the mapping doesn't extend into the query string.
Fixed bug where rebuild of local file indexes would repeatedly refresh without doing any actual work.
v2.0.0.0043 - 04/19/2001
Fixed two bugs in BuildIndex which were causing it to "forget" two files each time it did a META-refresh while indexing realms that use "file-system-discovery"
Fixed caching system -- the list of files waiting to be indexed is now properly persisted between META-refreshes of BuildIndex, rather than being re-generated with each process. This should speed up the index process quite a bit.
Fixed bug in &realm_interact() which caused duplication of records when updating realms that use "file-system-discovery"
Updated French and German versions
v2.0.0.0042 - 04/13/2001
Added support for the "last-modified" META header (help topic)
Added support for searching by last modified time and last indexed time
Added support for "include-by-name" value for Realm (help topic)
Added support for %realm% replacement value (help topic)
Added support for randomizing search result order for results with equal relevance
Fixed bug with "keywords" field on the EditRecord screen
Fixed bug involving javascript "check all" and "clear all" links on "Add Page" form
Fixed bug where runtime realms could not be searched when SQL was enabled
Fixed bug where deleting or renaming a Filter Rule resulted in the error "Not a SCALAR reference at searchmods/common_admin.pl line 10282" (bug introduced in 0041)
Fixed typo in cmd_admin.pl
Suppressed spurious warning messages in hacksubs.pl and cleaned up documentation of setperms scripts
v2.0.0.0041 - 03/26/2001
Added validation to user-defined regular expressions in DeleteRecord.
Added support for Perl's taint mode (the -T switch)
Added quotemeta() validation to "Ignore Words". Previously adding "(foo}" to the "Ignore Words" array would cause a script error.
Updated French translation
Fixed bug in "cmd_admin.pl" shell admin script; added documentation for that script.
Fixed bug where error message 'unable to read from file' would appear instead of 'functionality disabled in demo'.
Fixed bug where adding multiple-URL pattern to the 'Forbid Sites' rule would treat as exact string rather than Perl regex (bug introduced in 0038)
Custom Filter Rules are now disabled when toggling to Freeware mode
v2.0.0.0040 - 03/22/2001 - stable release
Fixed bug where registration key wouldn't work when customer email address contained a hyphen '-'
v2.0.0.0039 - 03/21/2001 - stable release
Fixed bug involving the view/edit of default Filter Rules like "Forbid Pages"
Fixed bug where Edit URL was not validating new 'size' and 'promote' values -- led to data corruption when non-integer values entered.
Added additional security checks against ?{} code-executing regular expressions in Filter Rules and system settings.
Added version-checking to library files
Updated French translation
v2.0.0.0038 - 03/21/2001 - stable release
Extended the User License Agreement. Users may now choose from "Freeware", "Trial Shareware" (default), or "Registered Shareware" license modes. The Freeware mode offers a subset of the FDSE feature set.
Now using named variables like $s1 and $s2 in strings.txt, rather than %s. This allows the order of variables to be switched when translating to other languages.
Added feature: no frames admin view. Works around a problem at netfirms.com where inserted advertisement header prevents framesets from working.
Added new type of Filter Rule action, "No Update on Redirect", which will help in indexing affiliate sites.
Filter Rules may now be assigned to specific Realms or groups of Realms.
Filter Rules section has been completely translated to non-English languages.
Added support for FDSE proprietary META tags: FDSE-keywords, FDSE-description, FDSE-robots.
Added support for proprietary "fdse-index-as" META tag.
Optimized crawler code to reduce memory usage.
Added several new sections to the help file.
Fixed bug where adding a recently-deleted URL to the Forbid Sites rule would fail if the URL contained meta-characters like ? or *.
Fixed bug where entire Filter Rules would be corrupted when adding a recently-deleted URL to Forbid Sites when that URL contained an invalid regex string like ?*.
v2.0.0.0037 - 03/03/2001
Added feature: the realm name can be used as a keyword for purposes of selecting banner ads. The banner ad must have keyword "realm:foo" to appear when the user is searching realm "foo".
Added feature: banner ads can be configured to only appear when there is a keyword match.
Upgraded format of 'ads.xml' file to avoid data corruption when keywords strings include HTML -- previously a keyword like "<foo!>" would nuke the file.
Updated help file to cover changes to advertising system.
Added error checking to 'print' statements. This protects from cases when a disk quota is exceeded while writing to a file. Previously would risk losing user data.
Added feature allowing users to delete the physical files used for the index when deleting a realm. Previously would always save files (contributing to cases where disk quotas would be exceeded...)
Fixed several bugs in SQL version where database handle was not being disconnected.
Fixed bug in SQL version where editing a record would cause the record to be "lost".
Updated help file to cover using SQL on Windows 2000
Fixed bug where Edit Record feature would not work when renaming a base URL which contained meta-characters like "?"
Crawler will now parse links from HTML files which have the META robots "noindex" value. Previously treated "noindex" as "none".
Added support for "bypass_file_locking.txt" override file. Useful for old Sun machines that don't support shared read locks on files (problem symptom: all requests result in error "could not get shared read lock on file 'x'").
Fixed French version - earlier contributions to 0036 didn't all get added to the 0036 release. They're in this one.
v2.0.0.0036 - 02/22/2001
The French translation was human-updated! Many thanks to the man who did it.
Added new Personal Setting "Admin Notify: Sendmail Server". This accepts the path to sendmail, i.e. /usr/sbin/sendmail. By defining a command-line sendmail program, the script will by-pass the socket sendmail logic. This is useful when the web host does not offer sockets privileges for mail.
Added new General Setting "Use Standard IO" which is enabled by default. Under some circumstances, NT 4.0 will not function properly with standard I/O and it must be bypassed, and this setting will allow affected users to do so.
The initial request to the starter document in a File-Fed Realm now uses a temporary "Max Characters: File" setting of 2048000, rather than the default 64000. Fixes bug where not all URL's on the initial page were being recognized.
Stopped using Perl 'package' keyword. Fixes namespace bug on some systems which resulted in "Undefined subroutine &main::Trim called at..." error.
Now shipping a default "filter_rules.txt" that has some useful rules (disabled by default) to get people started, including Adult Content Filtering and limiting Spider Levels.
The default path to Perl in the download version has been changed from "/usr/bin/perl" to "/usr/local/bin/perl" to reflect the prevailing standard location.
Fixed bug where 'Minimum Whitespace' rule was being applied to binary files in local realms (and causing them to be rejected).
Fixed bug where renaming a Filter Rule would duplicate it.
Fixed bug where editing the 'url' field in a record would cause the record to be deleted.
Fixed about 10 bugs where $err_msg was being overwritten before the error text could be displayed to the user.
v2.0.0.0035 - 02/13/2001
New feature: allow "terms" and "q" as aliases for "Terms" in HTML FORM syntax
Improved command-line syntax: "searchmods/powerusr/cmd_admin.pl"
Fixed bug from 0034 where it was not possible to create a new Filter Rule
Fixed bug from 0034 where the text "Assertion Error" appeared when PICS filtering was enabled
v2.0.0.0034 - 02/08/2001
Added "searchmods/cmdline.pl" tool for rebuilding realms from the command line (experimental)
Fixed bug where it was not possible to edit URL's with metacharacters like '?' or '&'.
Fixed bug where it was not possible to link to URL's from the AdminVersion when the URL contained the '&' character.
v2.0.0.0033 - 02/07/2001
Added professional Dutch translation (thanks Richard van Rucphen!!)
Function parse_meta_header now looks in the first 4096 bytes of a file, rather than the first 1024 bytes, when searching for a meta header (includes meta robots, refresh, description, keywords)
Introduced 2 new templates, linkline1.txt and linkline2.txt. They hold the links "Search Tips - Add New URL - Main Page" or "Search Tips - Main Page" that appear at the bottom of each public search page. With the introduction of these templates, the settings "Main Page Link" and "FontTag" and "FontClose" were no longer needed, and have been removed from the %Rules hash.
The setting "strTarget" has been removed from the %Rules hash. Previously it had held the TARGET=frame attribute for the search results list. Now that the line_listing.txt template can be edited directly in the UI, this separate rule isn't needed. Also it had a default value that caused problems for some users.
Made the "session expired" time a configurable setting (Admin Page => Personal Settings => Security Settings)
Now stripping STYLE and SCRIPT blocks earlier in the parse HTML routine. Fixed bug where a href links appearing in Javascript would be followed (usually resulting in a 404 since the links contained Javascript code rather than literal paths).
Fixed bug where a variable was declared in a conditional clause. Causes a fatal error on Perl 5.003 and other early versions.
Fixed bug where indexing a website like "http://www.xav.com/***********#@$#@!$#@#$!#@!$#!$#@%$%&^%&^%/" would cause regex errors. Now more careful about using quotemeta on inputted values.
Fixed bug where indexing a website whose starting document contains a query string would fail for all documents, i.e., "http://www.dm.net/~deb/cgi-bin/contents.cgi?toc=fdse"
Fixed bug where "filter_rules.pl" was not being included in the public search script. Would cause a failure when searching included a runtime realm.
Cleaned up Edit/Create realm interface. Fixed some wording errors. Fixed bug where "rename" operation would duplicate the realm instead. Fixed bug where runtime realms were being saved as local website realms.
v2.0.0.0032 - 01/27/2001
No longer use LCASE in SQL statement when viewing most common search terms. This was causing an error on early versions of mysql where LCASE wasn't available.
Optimized module-inclusion logic. Now "filter_rules.pl", "common_admin.pl", and "crawler.pl" are only loaded for admin requests. In the process, added new module "common_admin.pl" and removed "search_ads.pl"
Optimized strings.txt parsing - now only first 50 lines are loaded for non-admin requests
Added "English Language Searching" setting. When disabled, all filtering of non-word characters is suppressed, and searches of two-byte languages may work better.
v2.0.0.0031 - 01/26/2001
Updated copyright to 2001 - Happy New Year everyone!
New HTML format for the admin user interface
Translated (programmatically) the script into: German, Spanish, French, Italian, Portuguese. This feature should be considered under development until the translator files have been updated by a human.
Now able to edit HTML template files from the admin interface
Search results are no longer returned within "BLOCKQUOTE" tags.
Now allow advertisements to be targeted to certain positions in the 1-4 defined positions
Added "Admin Notify" feature. At the admin's option, an email will be sent to the admin whenever somebody adds a new URL.
Added "AllowAnonAdd: Require User Email" feature. At the admin's option, visitors will be required to include a valid email address when they add their URL to the index.
Added new type of realm: a "website realm" is one tied to a particular site. The rebuild feature for this realm will re-index the entire site. This is similar to local realms, without the requirement that the site be hosted on the same server as the search engine
Added another new type of realm: a "file-fed realm" is one that consists of all web pages linked from a certain starter HTML document (all links are followed just 1 level deep)
Added new feature: FDSE:ROBOT HTML tag. Allows a web author to apply robots exclusion rules to arbitrary sections of the page, rather than applying them all-or-nothing to the whole document
Realms are now created on-the-fly as needed, in response to the "Add Web Page" commands of the administrator.
Improved logic for handling visitor-added web pages that can't be crawled
Fixed bug whereby binary files were being parsed for content in local realms, rather than having just their titles and URL's indexed.
Fixed bug whereby pages that were added by anonymous visitors, whose HTML title or description contains "||", these pages would be stuck in the "Require Approval" process indefinitely.
Fixed bug whereby "Show Examples" feature would include excerpts that didn't include the search term, when the search terms included a wildcard, and when more than 3 examples were displayed
Improved logic for ignoring FrontPage and FDSE admin folders (those whose URL includes "_vti_", "_private", "searchmods", or "searchdata"). Previously had special-case code inside &parse_html_ex and &GetFilesByDirEx. This code didn't catch all the cases, and it was pretty inaccessible for the average user. Migrated the logic into a system FilterRule named "Admin Pages". This rule will apply to all cases and can be easily managed by the admin.
Fixed bug whereby robots.txt wasn't being parsed correctly for local realms. Also now properly handle Disallow: directives with null values.
Fixed bug whereby last modified time was set to the current time, rather than the Last Modified HTTP header response
Removed "UnixTime" as a displayed field in the search log
Used a tighter display of the date in logging display - "03/11/2000 14:20" instead of "March 11, 2000 2:20:23 PM"
Updated manual install instructions to take into account new folder structure
Changed auto-installer so that it does not strip code comments; now Auto-Installed scripts are the same as Manually Installed versions.
Auto-installer no longer reset the password.
SQL support should be considered "under development". At present it works for most users, but is slower than using text files.
v2.0.0.0030 - 12/20/2000 - stable release
Fixed logging bug whereby attempting to sort by column heading would display the list of log options, rather than sort by column heading.
Added $host variable as another option to use when customizing the "line_listing.txt" template.
SQL support should be considered "under development". At present it works for most users, but is slower than using text files.
v2.0.0.0029 - 12/11/2000
Improved logging feature. Now admins can view most popular search terms in addition to the linear list of searches. Logs are stored in an industry-standard format for easy analysis (either a CSV text file or a SQL table). Text CSV log format now allows interoperability with "logresolve" for converting IP addresses into hostnames.
Fixed bug involving $Redirector variable; previously, when following a link to a page that an Anon User had just added, or when following a failed link on the list of queued pages, the $Redirector variable would be uninitialized and a 404 error would result. This has been fixed for all general cases.
Removed limit of 500 on "Crawler: Max Pages Per Batch" setting; now can be any value.
Improved the SSI parsing functionality of sub PrintTemplate. Extensively documented the new functionality in the help file under heading "Customizing HTML".
Added hierarchical navigation links across the top of the admin page. For clarity, removed all decorative underlining on items that were not links.
Added two new templates: "admin_header.txt" and "admin_footer.txt". These allow you to customize your front-search differently than your admin pages. By default, however, the admin pages will be identical to the current "header.htm" and "footer.htm" templates.
Corrected instance where default "30" was hardcoded instead of using dynamic $Rules{'crawler: days til refresh'} value.
v2.0.0.0028 - 11/21/2000
SQL support should be considered "under development". At present it works for most users, but is slower than using text files.
Fixed bug in attribute searches for "link:term" or "text:term". Was failing in SQL version.
Fixed handling of invalid regular expressions in Filter Rules. Previously, users would define patterns using expressions like "*.xav.com" which are forbidden in Perl - Perl requires ".*.xav.com". These invalid regular expressions were causing the entire script to fail when they were executed. Now, all patterns are tested for validity when the rule is created, and users receive warnings if their pattern is invalid. Also, when existing invalid rules are executed, they will now fail nicely with a nice error message rather than crashing the entire script.
v2.0.0.0027 - 11/20/2000
Fixed typo that was causing string "Add URL" to show up as "$lang_strings[4]" in the search engine output.
Fixed bug in search syntax - keyword "or" was being treated as "and" when default match was set to "Match All Terms".
Fixed bug that was preventing administrators from approving visitor-added URL's in the SQL version.
Improved structure of the process_queued_pages function; now, if no visitor-added URL's are approved, or if there is an error during the approval process, the data file holding pages waiting for approval will not be touched. Previously data would be deleted before the system was guaranteed that the approval process succeeded.
Changed array declaration in sub Capitalize; previous non-standard Perl syntax may have been responsible for at least one reported failure.
v2.0.0.0026 - 11/17/2000
SQL support should be considered "under development". At present it works for most users, but is slower than using text files.
Fixed SQL create table syntax to work on more versions of mysql
Added existence checking for DBD::mysql module
Added $realms->use_database() function to the @Export array (was causing crashes under some versions of Perl)
SQL table names are now configurable settings, allowing users to run multiple parallel instances from a single database.
Fixed performance problem that arose in build 0018 with the "Show Examples" feature. Was causing searches to take up to 50% longer than on earlier versions. Now performance is comparable to pre-build 0018 when the "Show Examples" feature is disabled.
Took first steps towards extracting all readable strings from the code and placing them in separate files for easy translation. This feature is still under development.
Standardized nearly all error-handling code; most non-UI functions now return just "$err_msg" if there is a failure, rather than the redundant "$is_error, $error_message".
Took steps to reduce memory overhead - moved a few global variables into local variables; using while loops instead of foreach loops where possible. This is an ongoing effort.
v2.0.0.0025 - 11/08/2000
Fixed regex in sub StandardVersion; searching for "c++" no longer causes a fatal error. Now any search terms which contain non-alphanumerics like ++ will not be bolded in the description or context; other search terms will still be bolded.
Changed the way Perl libraries are require'd. Now all libraries are loaded at start-up, using the same methodology. Should reduce the number of errors like "unable to find searchmods/crawler.pl in @INC"
Function check_db_config now checks mysql version and prints it.
Fixed minor display bug in function GeneralRules, where string values longer than 15 characters were repeated
v2.0.0.0024 - 11/08/2000
Fixed SQL syntax in search queries
Optimized record creation to use less CPU while indexing; the engine appears to use quite a bit more CPU while indexing when SQL is enabled
Fixed "of X records searched" report on search page
Removed some debug print statements that shouldn't have been in version 0023
v2.0.0.0023 - 11/07/2000
Added support for mysql databases
Package LockFile now handles permissions setting internally.
Now have multi-delete option - after deleting a single URL, system will prompt to delete all URL's from the host, same folder, etc.
Added new HTML template, "line_listing.txt", which controls how each search result is displayed.
v2.0.0.0022 - 11/07/2000 - stable release
Fixed bug where the crawler would not save information about failures. In some scenarios, like rebuilding the index, this would cause the re-index process to loop continuously on a single failed page.
Fixed CreateRealmForm function - now showing the proper default folder when the search engine is installed to a subdirectory
v2.0.0.0021 - 11/04/2000
Fixed bug with saving passwords; when settings.pl is not writable, now returns proper error message, rather than suffering fatal error. This was causing many installs to fail.
Added bug report form which shows up when search.pl suffers fatal Perl error - should speed up detection of errors like one above
Cleaned up wording and format of DeleteRealm command/function
Improved error handling in search_ads.pl, when ads.xml is not writable - now returns proper error message.
Fixed off-by-one error in filter_rules.pl, which was causing the list of pages awaiting approval to start at 2 rather than 1.
The Javascript function links "check all" and "clear all" now only appear if the user has Javascript enabled.
v2.0.0.0020 - 11/03/2000
Fixed bug in search.pl itself, which was preventing administrators from "Following Checked Links"
v2.0.0.0019 - 11/02/2000
Fixed bug in FlockEx, which was causing file locking to be bypassed.
v2.0.0.0018 - 10/25/2000
Added feature for "Show Examples". Allows you to display where the search term appeared within the document.
Fixed bug whereby new records added after Admin Approval were missing trailing newlines, preventing them from showing up in search.
Now the "Add URL" link will only be shown to visitors if Remote Realms have already been created for them to add to. Previously visitors would click on "Add URL" and be shown an error message that no Remote Realms exist.
Now using the "footer.htm" template as the footer for the admin page - previously only header.htm was used for building the admin page. Fixes bugs involving mismatched tags between header and footer templates.
No longer displaying the last modified time for the index files - the table of data was becoming too busy.
v2.0.0.0017 - 10/21/2000
New functionality: now all settings from settings.txt can be viewed and set from the web interface.
Add FlockEx and CryptEx stub functions as an abstraction on flock() and crypt(). Systems that don't support flock or crypt are now auto-detected in these stub functions, rather than causing the whole search engine to crash. Fixes major bug where Windows 9x web servers were crashing on post-0009 builds because of their heavy use off flock().
Improved the Create/Update realm interface.
fdse_realms.list function changed to always return an alpha-sorted list of realms. Should add some consistency to the "select realm" drop-down boxes and the main admin page.
Fixed bug that was subtly turning 0-values to null strings when calling html_encode, url_encode, etc.
Replace functions LoadRules, WriteRule, ValidateRule with an FD_Rules object.
Function GetFilesByDirEx changed to no longer look inside "searchdata" and "searchmods" folders when gathering files for Local Realms.
Improved the search engine installer by making it more generalized. Now allows user to select from multiple versions of the engine. In theory it could install any script, not just FDSE.
No longer using -W switch to test for writability of "searchdata" folder. Permissions test was giving false negatives on Windows NT, preventing the use of admin functions even though all permissions were in order.
v2.0.0.0016 - 10/13/2000
Made all file operations use buffered stdio, by replacing all calls to syswrite() with calls to print(). Fixed problem where index files would become corrupted on some systems.
Tightened error-checking in the LockFile package. Should help prevent data corruption.
Now local file indexes are processed by Filter Rules, in addition to web-crawled pages.
Fixed accounting bug for local realms, whereby the file count was not taking into account pages that were excluded due to meta tags or size.
v2.0.0.0015 - 10/10/2000
Fixed offset-by-2 error that caused the Altavista-style "next hits" toolbar to show a maximum of 22 jumps, instead of the intended 20.
Corrected behavior for crawler when it has an error indexing a page that already exists in the index (for example, it finds a 404 error when trying to update a listing). It now correctly deletes the original record from the index file. Previously, it had left the original record intact, which contributed to garbage piling up in the index.
Added a "Filter Rules" feature which should help manage which URL's are added to the index, particularly by visitors. Allows filters based on strings in the hostname, URL, and document HTML of a page. Also allows filtering based on RASCi and SafeSurf PICS labels.
Added an "Administrator Approval" option for visitor-supplied URL's. Now admins can review pages before they exist in the index and come up in visitor searches. This should help cut down on spamming of the index. Approval can be configured as the default for all visitor additions, or can be required in response to certain strings, using the new Filter Rules.
Fixed subtle bug involving importing old data files into new engines. Would cause looping while re-indexing a realm that contained URL's with uppercase or mixed case hostnames (for example, a realm containing the page "http://www.MicroSoft.com/"). Now all URL's are required to be written with lowercase hostnames.
v2.0.0.0014 - 10/02/2000
Reversed the change from build 0013. All web pages will be read with the read() call, not recv().
v2.0.0.0013 - 10/01/2000
Replaced call to read() with call to recv() in crawler.pl, line 387. The incorrect call to read() works on most versions of Perl, but some newer versions are more strict.
v2.0.0.0012 - 09/24/2000
Added Altavista-style "Next" and "Previous" links that allow end users to better scroll through multiple pages of search results.
Fixed a number of wording errors that occurred with certain inputs. For example, when a query returned no matches, previously the engine would report "Listing documents 1-0 of 0". Now it just says "No matches found". Also, when all the search terms were part of the ignore list, the engine would report "Your search for '' found 0 documents". Now it just says "You search found 0 documents".
Fixed possible security bug, whereby people entering Javascript snippets as search terms - such as "<script>alert('Hello');</script>" - would have their code executed on the results page. Now all end-user strings are passed through the html_encode() function which causes the results page to render a literal "<script>alert('Hello');</script>", rather than pass through an executable script snippet.
Fixed bug in LockFile modules, whereby the search engine was not able to rebuild local realms.
Setting "max index file size" to 0 effectively removes all index file size checking. Previously a non-zero value was required. The default remains at 10mb.
The HTML parser now follows links in imagemaps (AREa href) in addition to traditional A, FRAME, and IFRAME links.
v2.0.0.0011 - 09/02/2000
Fixed bug whereby a readonly filehandle was being locked for writing.
v2.0.0.0010 - 08/29/2000
Fixed bug involving editing a specific URL (sub CompressStrip)
Added better file locking code. Previously, the search engine did not properly protect its data files from being accessed by multiple users at the same time, leading to loss or corruption of data when several users would try to edit the files at the same time (dozens of changes, not marked with #changed 0010 - look for "$obj = new LockFile").
Local file indexes are now built using a temp file, so that visitors can continue to search while the build process takes place (sub BuildIndex).
Fixed support for RUNTIME realms (file fdse_realms.pl, sub add).
Fixed order of HREF/TARGET in sub StandardVersion - was causing 404 errors on some Internet Explorer 5.0 clients.
v2.0.0.0009 - 08/16/2000 - stable
Fixed bug whereby target searches for "text:term" and "link:term" would always fail.
Added proxy server support to the crawler.
The search engine now stores all URL hostnames in lowercase. This will solve problems where the index had duplicates of the form "http://xav.com", "http://XAV.com", "http://Xav.Com", etc. The path portion of a URL is still case sensitive.
Greatly improved efficiency of the UpdateDB procedure, which is called whenever web pages are added to the search index.
v2.0.0.0008 - 08/05/2000
Fixed "Forbid Sites" logic for local realms.
Fixed "Ignore Words" functionality.
The searchdata folder will now contain a file "total_pages_indexed.txt". It contains a text string of the number of pages indexed in all realms. This file can be #included into your main page if you want to tell users how many files are indexed.
v2.0.0.0007 - 08/03/2000
The search.pl file no longer has Perl warnings enabled via the -w flag.
Fixed discrepancy between $CodeFilesDir variable and hardcoded "searchmods" path. Was causing some modules to be loaded while others were not. Also fixed occasional error "Subroutine AdsLoaded redefined". (Re-ordered module loading in sub load_files in search.pl) With this change, the file "search_ads.pl" will now always need to be present, as it is by default. You cannot safely delete the file to remove the functionality.
Added an option to track the number of web pages in each realm. The $realms object has two new methods defined at the bottom of the file fdse_realms.pl. The call $realms->setpagecount was added to common.pl in the functions BuildIndex, AddURL, and DeleteRecord.
Removed the bolding of search terms in the Description field when displaying hits. Although a useful feature, this was causing errors when the script is used to index 2-byte languages like Japanese, Catonese, and Korean. (Commented out lines in sub StandardVersion in common.pl).
Added an option for accent sensitivity. Currently the engine is "accent insensitive" meaning that "fur" is the same as "für". This was proving to be a problem for Swedish-language engines, where accent differences lead to a lot of unexpected results. Toggling accent sensitivity on or off will require a rebuild of all the search indexes - normally it should be initially configured and then left alone. Also enabling accent sensitivity may cause problems for those with US-English keyboards that don't easily type high Latin characters. Added option "Accent Sensitive" to the settings files (default 0), and to the %defaults array in common.pl. Added an if statement to sub RawTranslate to not strip accents if user has so configured.
The AutoInstaller has been improved, and the documentation for manual installs has as well.
v2.0.0.0006 - 05/23/2000
Substantial changes were made in this version. Rather than respond to individual bug reports, I did an line-by-line code review of all 5,100 lines of code. There were many cases where several functions did the same thing, or several blocks of code did the same thing. In these cases I consolidated code into single standalone functions that did the job right.
All HTTP headers now written with the "\015\012" end-of-line sequence, to strictly comply with the HTTP standards.
Functions "webEncode" and "escape" did the same thing; they were both replaced by the streamlined "url_encode" function.
Function "parse_meta_header" added to handle all parsing of HTML META tags.
Added additional error checking around inclusion of the Advertising Module
Added function ReadFile, and standardized its use for all cases of reading small text files.
Made file access errors explicit for template files - previously, a failure to read a template file would be ignored. Now the template files are required to be present and readable.
Removed legacy calls to "ForbidFilesFromSites" and "GetFilesByDir" from functions BuildIndex, ReviewIndex, and SearchRuntime. They have been replaced with the new GetFilesByDirEx function.
v2.0.0.0005 - 05/21/2000 - stable
The robots.txt file is now parsed for local realms in addition to remote realms.
URLs containing the string "_vti" are not followed during crawl sessions, since they usually point to Front Page admin files and folders.
A new constraint has been added: $Rules{'Max Characters: Text'}. This is the max number of bytes of text that will be saved in the database. Those who want to only index titles and descriptions can set this value to 0. (Note that $Rules{'Max Characters: File'} refers to the maximum number of bytes read in from the entire file; it is from these raw bytes that the title, description, and text is extracted.)
A new variable has been added for frames-based pages. To get all search pages to open in a specific frame, the $strTarget variable may be customized. By default all links open in the same window as the search engine itself.
Fixed bug whereby the Advertising Module was not working for users without cookie authentication.
Changed copyright from "Fluid Dynamics" to "Zoltan Milosevic"
v2.0.0.0004 - 05/14/2000 - stable
Corrected problem whereby the crawler would die when encountering very large robots.txt files.
v2.0.0.0002 - 01/06/2000 - stable
Added customizable HTML for the header, footer, and search tips.
Added banner advertising module, and supporting documentation in the help file.
Fixed bug where URL's listed in the @ForbidSites array would not be forbidden for local realms. Fix was in ForbidFilesFromSites function.
Improved documentation on the Delete Page interface.
Now correctly recognizing description Meta tags without double quotes. Previously, a tag like {Meta name=description content="foo"} was ignored because the word "description" is double-quoted. This affected the keywords Meta tag as well.
Fixed a bug where user-submitted URL's could be duplicated by submitting them with and without the trailing slash. For example, a user could enter "http://xav.com" and "http://xav.com/" and they would be treated as separate files in the database.
Removed the variable "$SearchTipsPage", since it is only used once. It has been replaced with variable "$SCRIPT_NAME".
v2.0.0.0001 - 11/15/1999 - stable
Browser compatibility - added a STYLE attribute to the search terms box so that it's approximately the same size in both Internet Explorer and Netscape.
Security - began using session cookies, rather than short-lived persistent cookies, for authentication. Fixed problem where Netscape was not saving a cookie for 2-part names, like "http://xav.com".
Added strict versioning to the script file, starting with version 2.0.0.0001.
v2.0.0.xxxx - 11/15/1999
Added success statistics to final page.
v2.0.0.xxxx - 11/13/1999
Updated GetRobotFile function to treat Disallow " " as allowing everyone access, rather than denying access to everyone. Also add debug output to this function to assist with trouble-shooting.
Moved the "my %FORM" declaration to the beginning of the script. It's previous location had caused the Realm select list in the SearchForm function to always default to the "All Sites" realm, rather than the one previously searched. This may have fixed other problems as well.
Edited source code to add a "local $_;" statement to the beginning of each function which didn't already have one. This will prevent data corruption, since I mix a lot of $_ use with function calls.
Edited code to declare all variables with initial values, and to scope each variable with "my". This increases script speed and stability, and should make the script 100% compliant with mod_perl. If anyone has a mod_perl server and would like to verify, I'd be grateful.
Removed option to print results in CompactForm. The only option is now StandardForm. Compact was rarely used, and the loss of this option cleans up the UI.
Lowered default "minimum bytes per page" from 256 to 128 bytes.
Corrected support for runtime realms; they were being treated as local indexed realms in many cases.
Improved the feature that makes search terms appear bold in the document description.
Added restartable local file indexing. Now, if the local index process takes more than 50 seconds, the script will save it's work and refresh itself. This prevents data loss due to server timeouts with large local file sets.
v2.0.0.xxxx - 11/06/1999
Updated search tips to correctly describe case sensitivity issues. Searches were made all lowercase about a month ago, but the tips had not been updated.
v2.0.0.xxxx - 10/30/1999
Removed CollapseURL function because it is no longer called.
-
Updated function GetAbsoluteAddress to fix the following errors:
Previous: http://intranet/../foo.gif => http://foo.gif
Now correctly maps to http://intranet/foo.gif
Previous: http://www.foo.com/../page.htm => http://www.foo.com/../page.htm
Now correctly maps to http://www.foo.com/page.htm
Fixed error in multi-word search when "All Terms" in use. Was causing search terms to be found only when they appeared as a phrase, instead of in any order.
Fixed typo in Extract_Meta function near line 3040; missing colon after "http". Was causing "Crawler: Follow Offsite Links" rule to fail.
Updated Extract_Meta function near line 3040; added code to ignore embedded links that matched the ForbidSites array. Checking for forbidden status is much more efficient at this point in the process.
Removed references to undeclared function "LastError" from lines 1595 and 1611. Replaced with "$!" Perl variable.
Fixed $Keywords versus $KeyWords ambiguity. As a result, Meta keywords are now showing up in the index again, as expected.
v2.0.0.xxxx - 10/30/1999
Perl path information was not being updated in the final version of the script. As a result, servers that accepted "#!/usr/local/bin/perl" worked, but all others, including the common "#!/usr/bin/perl", were failing. Failures appeared as the error "unable to set administrative password".
v2.0.0.xxxx - 10/10/1999
Fixed handling of non-English characters. Moved the call to RawTranslate to the first call in the CompressStrip function - necessary since RawTranslate only works if non-alphanumerics are present, and the other calls in CompressStrip were stripping non-alphanumerics. Also changed the MakeRecord function to stop calling RawTranslate on the URL, Title, and Description fields - by doing this, the user-visible text appears in it's original language format to the user.
Fixed the display of hits where the search terms show up in bold in the document description. Changed line 3207 in function StandardVersion from $Description =~ s!$Term!<B>$Term</B>!ig; to $Description =~ s!($Term)!<B>$1</B>!ig;.
Removed prototyping of functions - for example, MakeRecord ($$$$) changed back to just MakeRecord. Prototyping had been used on functions MakeRecord and Trim.
Added alpha-sorting to the list of realms shown on the admin page.
Added crawler rules for "Ignore Links To" and "Follow Offsite Links"; these variables are declared and documented with the %Rules array, and are referenced in the Extract_Meta function.
Fixed the DeleteRecord function so that it now removes the web page from both the index file and the pending pages file.
v2.0.0.xxxx - 10/03/1999
-
Made all searches case insensitive. Searching for "Bob" and "bob" are the same. This has major implications, including:
All index files must be rebuilt. For local indexes, choose "Index All Files". For remote realms, choose the new "Index All Pages" command.
The search engine is now about three times faster.
The following functions have had major updates: SearchRunTime, SearchIndexFile, RawTranslate, StripIgnoreWords, Format_Term
Changed RawTranslate function to dramatically improve efficiency in indexing sites
Changed SearchIndexFile and SearchRunTime functions to cache Perl regular expressions, improving search speed
Replaced StripIgnoreWords function with a StripIgnoreWords string which can be evaluated, allowing Perl to cache regular expressions. Also greatly extended the default IgnoreWords array in order to decrease index file size and speed up searches.
Added feature to restrict crawler to a single site.
Now use read() function for files and web pages to only read in a limited amount of data, given by Rules{'Max Characters: File'}. Previously the engine read in all the content before truncating data, and the engine would sometimes get bogged down with huge files. Using read is also more efficient. Changes Get_String and GetStringByURL functions.
When web addresses are added with trailing whitespace, the whitespace was previously preserved with the record, instead of trimmed. This caused the appearance of duplicate entries in the database. Fix was to trim whitespace from the entry.
Addition of bold-font for search terms which appear in document description (changed StandardVersion function)
Removed a set of little used functions (Read, Open, Lines, SetErr, etc.). Replaced these functions with actual code in places they were used. Changes were in functions DeleteRealm and LoadRealms.
Added Rules{'Index ALT Text'} and Rules{'Index Links'} search options. The indexing of embedded links is now optional and disabled by default, since it contributes about 10-15% to the index file size and is a little-used feature. The search tips page has been updated to mention that link:text searches aren't supported on all systems. The search of ALT text is also optional, but is enabled by default.
Added ReCrawlRealm and CrawlEntireSite for batch crawl sessions.
Edited HTML_UI and AddURL functions to expose the new administrative functions.
Removed unused LogOut function.
Add "Register" link to bottom of admin pages.
Changed GetStringByURL function to return file size as a third return value. This fixes a bug whereby large files, with content-length beyond Rules{'Max Characters: File'}, would have their size reported as the truncated max characters instead of their original length. MakeRecord now accepts a fourth parameter for file size.
The RawTranslate function, which changes non-ANSI characters like ã to a, is now called from inside MakeRecord() instead of when the data is read in. This speeds the process up, since the expensive RawTranslate function gets called after all the extra text has been stripped out, and it makes the data more accurate, since title, description, and links are extracted while the document is in it's pure form.
Updated function SaveLinksToFile to fix an offset bug, whereby the realm of the final few entries would be set to "0".
Updated BuildIndex function to better communicate progress to the UI.
Updated Extract_Meta function to better strip text between SCRIPT and STYLE tags, and to extract only the correct links when running in IndexEntireSite mode.
Added explicit chmod to every file-creating operation, so that data files will be editable by all users.
Removed expressions of the format "foreach my $Var (@array)" which were causing a failure on Perl 5.003. Changes were to SearchIndexFile and StandardVersion functions.
Added feature to remove all pages from the search.pending.txt file when a realm is deleted, and to add all pages to the file when a realm is added. For users who build index files remotely, the server's master search.pending.txt file can now be updated by deleting and the creating the realm via the HTML admin user interface. This will "sync" the web pages in the index file with the server's pending file. Having the pending file up-to-date is required when re-crawling files and rebuilding the index.
v2.0.0.xxxx - 10/03/1999
Added "htdocs" as a sample folder, along with "public_html"
Added code to print out FTP directory listing when auto-install is not able to find the correct folder
Added audit code to catch FTP sites of the format "ftp://ftp.site.com", "intranet_site", and "invalid!!!char".
v2.0.0.xxxx - 09/25/1999
Fixed scenario whereby a closing STYLE tag inside of a SCRIPT tag could cause script contents to appear in description and search engine.
Documents which contain non-alphanumeric titles (typically blank ones) will have their filenames or URL used instead.
Case sensitivity removed; all searches are now done without regard to case.
Changed index file format to conserve space, by only entering the title, description, and keywords once.
Added "Index ALT Text" rule to control whether text in the ALT attribute of images is included in the index file. By default it is true; it can be set to 0 to disable this behavior.
v2.0.0.xxxx - 09/23/1999
-
Corrected handling of French character é by changing line 1476 in sub RawTranslate from:
s!(é|é|é)!e!g;
to
s!(é|é|é)!e!g;
v2.0.0.xxxx - 09/12/1999
Added HTML header output for case when the search terms entered are all ignored (i.e., only common words like "web")
v2.0.0.xxxx - 09/11/1999
The HTML output was run through the RxHTML validator. 16 HTML structure problems, which may have caused problems for some browsers, were detected and fixed. One spelling error was fixed.
Added code to prevent > signs from appearing in description or text (had caused mild db corruption)
Fixed bug where local realm names which contained spaces would be treated as remote realms on the "Maintain" screen of the Admin page.
Fixed bug where building local realms could fall into an infinite loop when encountering symbolic directories which point to higher level directories, or when two sibling symbolic links pointed to each other's parent directories. Hopefully the workaround will not cause too many problems. See Following Symbolic Links while Indexing Files for details.
v2.0.0.xxxx - 09/01/1999
Fixed GetCookieDomain function to handle 2-part domain names, such as "xav.com" or "gte.net". Previously this was resulting in the script setting Auth cookies with domain of ".com" or ".net" which is illegally short and thus they were getting ignored by the browser and the admin was unable to log in.
Adding URL-Encoding to Realm Names, so realms named "web pages" (for example) can be manipulated from the Admin Page
Fixed bug with @ARGV which affected certain web server software, such as WebSitePro and possibly Roxen Challenger. Symptoms were that, when entering Admin screen, the first page would come up to prompt for password, but subsequent requests resulted in the normal search page coming up, instead of the advanced admin pages. This affected any web server which passed the querystring in both $ENV{'QUERY_STRING'} and as a command-line argument. Now the search engine looks for input using REQUEST_METHOD=POST first, then in QUERY_STRING, and finally, if neither REQUEST_METHOD nor QUERY_STRING are populated, it will read from the command line.
Upgraded logging module to record visitor IP address if hostname is unknown, instead of "undefined"
Fixed bug where FTP NLST command sometimes caused install failure when installing to empty directory. Now installing to empty directories is okay
Changed folder discovery process to better support FTP hosts which do not implement the PWD command
Added additional tracing output, including FTP SYST command to determine which types of FTP servers are failing the install
v2.0.0.xxxx - 08/31/1999
Stopped using defined() test when reading HTTP traffic. This had previously caused some sites to return 0 bytes when being crawled, and the error "Doesn't meet minimum 24 bytes of text" was coming up.
In mid-August 1999, the "Xavatoria II" beta script was officially renamed to "the Fluid Dynamics Search Engine" and was placed on this site.