Crawling password-protected web pages that return 401 Auth Required
Background
Some web pages are password-protected at the HTTP level (for example, www.xav.com/scripts/help/auth_test). The web server will return "401 Authorization Required" if these web pages are requested without a valid username and password.
By default, the FDSE web crawler does not supply usernames or passwords, and so all web requests for these pages will fail with "error: server responded with '401 Authorization Required' instead of '200 OK'."
How can I index web pages that are protected?
Temporarily disable password protection. Run the web crawler to build the index. Then enable password protection again.
Or, if the web pages are on the same server as FDSE, you can use the file system crawler. Go to "Admin Page" => "Manage Realms" => "Create New Realm" and create a realm that uses the file system crawler. Enter the folder that maps to the web site, then save and build the index. The file system crawler accesses documents directly, without going over the web, and so it does not need to know the username and password.
Both of these work-arounds involve the following risks which you should understand before attempting:
Data which has been secured on the protected web pages is now available in your index file and via your search results. If your web pages are password-protected, but your search engine is not, then your visitors will be able to see titles, descriptions, and fragments of the secure pages without logging in. In some cases, the FDSE index files themselves are available over the web, in which case all data would be visible without logging in.
The web pages listed in the search results require valid credentials to access. If your web pages are password-protected, but your search engine is not, then your visitors will be able to see the results listings but may not be able to follow the links, leading to frustration.
The proxy.pl tool will not necessarily work with those search results (see Advanced Search: Highlighting search terms in the actual document)
Custom Coding Work-Around
I have chosen not to add HTTP authentication support directly in to the FDSE code because 1) FDSE is primarily a public search engine and 2) because of the security issues listed in this file.
If you absolutely must get this to work, here is the custom coding workaround. Edit library file "common_admin.pl" and find subroutine raw_get, where the default headers are declared. Find the existing line of code that looks like this:
$Request .= "User-Agent: $::Rules{'crawler: user agent'}\015\012";
Insert the following custom code below that line, where "username" and "password" have been customized for your needs:
$Request .= "User-Agent: $::Rules{'crawler: user agent'}\015\012";
$Request .= "Authorization: Basic " .
&encode_base64( "username:password" ) . "\015\012";
(Note that in older versions, $::Rules is written as $Rules. Keep using whatever name is used by the code you are editing.)
You will also need to paste this subroutine into the common_admin.pl library (taken from MIME::Base64):
sub encode_base64 {
my $res = "";
my $eol = $_[1];
$eol = "\n" unless defined $eol;
pos($_[0]) = 0; # ensure start at the beginning
while ($_[0] =~ /(.{1,45})/gs) {
$res .= substr(pack('u', $1), 1);
chop($res);
}
$res =~ tr|` -_|AA-Za-z0-9+/|; # `# help emacs
# fix padding at the end
my $padding = (3 - length($_[0]) % 3) % 3;
$res =~ s/.{$padding}$/'=' x $padding/e if $padding;
# break encoded string into lines of no more than 76 characters each
if (length $eol) {
$res =~ s/(.{1,76})/$1$eol/g;
}
$res;
}
This solution has the same risks as the work-arounds above, and these additional risks and limitations:
-
Sends the same Authorization header to all sites. The Authorization header includes the clear-text password to your site. Making the above customization for one site and then spidering a second site will expose the username and password entered above to that second site.
Ideally you would only make this code change temporarily while you were indexing the given site, and would then comment out the changes.
-
This code change is not supported. I may provide paid support at my usual rate but I won't provide the free support that comes with FDSE itself.
-
Works only for Basic authentication, not fun stuff like NTLM or Digest
"Crawling password-protected web pages that return 401 Auth Required"
http://www.xav.com/scripts/search/help/1102.html