Advanced Search: Highlighting search terms in the actual document
With the FDSE proxy tool, your visitors can view documents with the search terms highlighted. For example:

Visitors can always view search results normally by simply clicking on the hyperlinked title. However, now a new link is offered, "View with highlighted search terms". When visitors follow that link, the FDSE proxy.pl or proxy.cgi script does the following:
- intercepts the requests, and makes a behind-the-scenes HTTP request for the document
- inserts a customizable header. The default header includes a disclaimer message about how you are not necessarily responsible for the content of this page, etc.
- then displays the page with each search term highlighted. The default highlight is bold black text on a yellow background.

How to enable
Follow these steps to allow visitors to view documents with highlighted search terms.
Install FDSE version 2.0.0.0052 or newer.
Visit the "proxy.pl" script that is installed at the same level as your "search.pl" script.
Some systems require the ".cgi" extension (or some other extension) instead of ".pl". If that's the case with you, then treat .pl as .cgi everywhere in this document unless otherwise noted.
Edit the source code of "proxy.pl". Find the text
$SECURITY_ENABLE = 0. Set this to$SECURITY_ENABLE = 1.When you visit proxy.pl, you should see the default output which includes a string of help text and a test form with URL and keyword inputs. You must first get the proxy script working before you proceed. If you have trouble, post to the discussion forum with a detailed description of your problem.
Make a test request by entering a sample URL and search term, such as "http://www.yahoo.com/" and "news". You should see a proxy-translated version of the Yahoo! main page. If you don't, then it may be that your web server is behind a firewall or proxy server or perhaps your system doesn't allow CGI scripts to make socket requests. If any of these things are true, then the FDSE proxy utility cannot be used on your site.
Once you're satisfied that proxy.pl works, you may want to edit the source and customize the disclaimer text and/or the highlighting colors.
-
Once proxy.pl works and has been customized, edit the "line_listing.txt" template. Add something like this:
<a href="proxy.pl?terms=%url_terms%&url=%url_url%">
View with Highlighted Search Terms</A> -
Your visitors will now have the option of viewing results through the proxy viewer.
Security considerations
The following are some of the security risks associated with using the FDSE proxy tool. Please read through them carefully and do not enable the proxy tool if these risks apply to your site.
-
Open proxy danger: the FDSE system can act as an "open proxy" when it allows any anonymous visitor to request any URL without revealing his own original IP address. Sites that operate open proxies are considered bad Internet citizens.
Example of threat: an aspiring hacker could try to compromise the whitehouse.gov server by repeatedly requesting http://whitehouse.gov/admin/u=$user&p=$pass with different values of $user and $pass. One way that whitehouse.gov protects itself is through the threat of their admins detecting these probes, then looking up the perpetrators, and prosecuting them. An FDSE open proxy allows the aspiring hacker to make all requests as http://xav.com/proxy.pl?url=http://whitehouse.gov/admin/u=$user&p=$pass. In this case, any backtracking by admins will show requests coming from "xav.com" rather than the original hacker IP *. The investigating admins might then vent their rage at the person who was operating an open proxy.
To minimize this risk: use the setting
$SECURITY_MATCH_PENDING_FILE=1. In our example, the proxy.pl tool would not allow the hacker to visit http://whitehouse.gov/admin/u=$user&p=$pass unless that URL was already in the index, which it wouldn't be.* FDSE does relay the originating IP in the HTTP_X_FORWARDED_FOR variable, but this variable is usually not stored in server logs and so it is not as useful to investigators.
-
Exploiting IP-address trust relationships: some web-based systems are secured using the concept of "trusted IP ranges".
Example of threat: a guestbook program may allow normal access to all web visitors by default, but it will grant admin access to visitors from the webmaster's own computers, as determined by the visitor IP address and a pre-configured set of trusted IP addresses. If the IP address of the web server itself is considered a "trusted IP", and if proxy.pl is installed on the web server, then any traffic routed through proxy.pl will show up as coming from a trusted source, even though it should not be trusted.
To minimize this risk: do not include the web server IP address in any "trusted IP" sets. Also, use the
$SECURITY_MATCH_PENDING_FILE = 1setting and do not include any sensitive URL's in your search index files. -
Exploiting DNS-namespace trust relationships: some browsers like Internet Explorer allow users to configure their security settings on a per-domain basis.
Example of threat: a web user might allow unsigned, unsafe ActiveX controls on xav.com, but not on other sites. The user thus considers anything coming from the ".xav.com" site to be trustworthy. An aspiring hacker would exploit that trust by creating a web page at http://foo.com/hack.html which contained a malicious unsigned, unsafe ActiveX control. He would then link the unsuspecting user to http://xav.com/proxy.pl?url=http://foo.com/hack.html. In this way he could cause his web pages to be executed under the more trusting rules which are applied to .xav.com content.
To minimize this risk: use the
$SECURITY_MATCH_PENDING_FILE = 1setting and do not include any untrusted URL's in your search index files. Generally the only reason that visitors would have applied special liberal trust rules to your site is if your site requires execution of otherwise unsafe content. If your site uses very unsafe content in the first place, it is best not to install the proxy.pl tool at all. For a text-only site like xav.com, it is unlikely that any visitor would have special trust rules assigned to the xav.com name and so this is unlikely to put visitors at risk. -
Exploiting cookie-namespace trust relationships: Javascript code can be passed to read and write visitor's cookies, using the same malicious URL's as in the above example.
Example of threat: a web-based email site might stored user's login and password information in a cookie. Normally this is okay since the cookie is only shared with the originating server. However, once the proxy.pl tool is added, an aspiring hacker can force a Javascript-containing page to be rendered on the visitor's browser in the namespace of the originating server, thereby exposing these credentials.
To minimize this risk: use the
$SECURITY_MATCH_PENDING_FILE = 1setting and do not include any untrusted URL's in your search index files. If your site uses cookies which contain sensitive data, it is best not to install the proxy.pl tool at all. Using path-specific cookies will also prevent this exploit.
As of FDSE 2.0.0.0052, the proxy.pl/proxy.cgi ships with $SECURITY_ENABLE = 0. These risks are only present after enabling the proxy.
Known limitations
-
The proxy.pl script shares many of the limitations of the FDSE web crawler. It cannot view SSL web pages, nor web pages protected by usernames and passwords. It cannot make requests if the firewall/network configuration, or the system policy, do not allow Perl CGI scripts to make requests.
-
Some web pages are highly dynamic, displaying custom information based on cookie, IP address or other visitor-specific information. The proxy.pl script does not transparently relay all visitor information. Thus some web pages may not appear properly. MSNBC.com pages have this limitation.
-
Some Javascript code will cause pages to not appear correctly through the proxy viewer. At the time of this writing, Microsoft.com pages have this limitation.
-
Frameset documents will not appear properly. Visitors will have to click through to the actual document. That is one of the reasons that it is important for a direct link to remain in the header/disclaimer area.
-
Keyword highlighting requires a case insensitive* direct literal match. Phrases and keywords containing wildcards will not be highlighted. Keywords containing extended characters, like "üchü", will match literally but will not match non-literal HTML renditions, like
üchüorüchü. The various Character Conversion settings will not be applied.Case insensitivity is done using the Perl /i regex modifier, which maps A-Z to a-z, but does not handle Latin conversions like Ü to ü.
-
The proxy viewer is a stand-alone script. FDSE settings like language or "Use Standard IO" or "Crawler: User Agent" do not apply to this script.
-
The proxy viewer may place a high CPU and bandwidth load on your web server.
History: the proxy viewer was added in FDSE version 2.0.0.0050.
"Advanced Search: Highlighting search terms in the actual document"
http://www.xav.com/scripts/search/help/1106.html