Home > Fluid Dynamics Search Engine > Help > 1102

Crawling password-protected web pages that return 401 Auth Required

Background

Some web pages are password-protected at the HTTP level (for example, www.xav.com/scripts/help/auth_test). The web server will return "401 Authorization Required" if these web pages are requested without a valid username and password.

By default, the FDSE web crawler does not supply usernames or passwords, and so all web requests for these pages will fail with "error: server responded with '401 Authorization Required' instead of '200 OK'."

How can I index web pages that are protected?

Both of these work-arounds involve the following risks which you should understand before attempting:

Custom Coding Work-Around

I have chosen not to add HTTP authentication support directly in to the FDSE code because 1) FDSE is primarily a public search engine and 2) because of the security issues listed in this file.

If you absolutely must get this to work, here is the custom coding workaround. Edit library file "common_admin.pl" and find subroutine raw_get, where the default headers are declared. Find the existing line of code that looks like this:

$Request .= "User-Agent: $::Rules{'crawler: user agent'}\015\012";

Insert the following custom code below that line, where "username" and "password" have been customized for your needs:

$Request .= "User-Agent: $::Rules{'crawler: user agent'}\015\012";
$Request .= "Authorization: Basic " .
	&encode_base64( "username:password" ) . "\015\012";

(Note that in older versions, $::Rules is written as $Rules. Keep using whatever name is used by the code you are editing.)

You will also need to paste this subroutine into the common_admin.pl library (taken from MIME::Base64):

sub encode_base64 {
	my $res = "";
	my $eol = $_[1];
	$eol = "\n" unless defined $eol;
	pos($_[0]) = 0;                          # ensure start at the beginning
	while ($_[0] =~ /(.{1,45})/gs) {
		$res .= substr(pack('u', $1), 1);
		chop($res);
		}
	$res =~ tr|` -_|AA-Za-z0-9+/|;               # `# help emacs
	# fix padding at the end
	my $padding = (3 - length($_[0]) % 3) % 3;
	$res =~ s/.{$padding}$/'=' x $padding/e if $padding;
	# break encoded string into lines of no more than 76 characters each
	if (length $eol) {
		$res =~ s/(.{1,76})/$1$eol/g;
		}
	$res;
	}

This solution has the same risks as the work-arounds above, and these additional risks and limitations:

  1. Sends the same Authorization header to all sites. The Authorization header includes the clear-text password to your site. Making the above customization for one site and then spidering a second site will expose the username and password entered above to that second site.

    Ideally you would only make this code change temporarily while you were indexing the given site, and would then comment out the changes.

  2. This code change is not supported. I may provide paid support at my usual rate but I won't provide the free support that comes with FDSE itself.

  3. Works only for Basic authentication, not fun stuff like NTLM or Digest


    "Crawling password-protected web pages that return 401 Auth Required"
    http://www.xav.com/scripts/search/help/1102.html