How to force or override DNS lookups
Forcing or overriding DNS responses is intended to work around the following problems:
-
Some web servers have been set up without valid DNS servers. On these web servers, the FDSE crawler is not able to crawl any web site by name, like "http://www.yahoo.com/" though it can crawl by IP address, like "http://64.58.76.223/". When you try to crawl by name, it will return the error "unable to resolve www.yahoo.com to an IP address - resource unavailable".
These servers may be misconfigured due to incompetence or because the system administrators want to make it more difficult for Perl CGI scripts to initiate network traffic. The best approach is to ask your system administrator to properly configure the server and to also confirm that CGI scripts are allowed to initiate network traffic. This workaround should only be used when you are sure that you are allowed to use the network, but your system administrator is not otherwise helpful.
-
A small minority of web sites use complex distributed architectures for caching and load balancing (typically only the busiest sites do this). Many of these complex systems will tweak DNS such that the DNS address of the site points to a reverse proxy system on the public network, while the true web site resides on a private network behind the reverse proxy system. In these systems it is sometimes impossible for a CGI script residing on the private network to resolve or request the public address.
In these cases you may need to override DNS to point to the private network address or to 127.0.0.1.
FDSE caches all DNS lookups for the lifetime of the FDSE process. DNS queries are stored in the network client hash in the form:
$$p_nc_cache{"H:www.yahoo.com"} = pack('C4',split(m!\.!,'64.58.76.223'));
# means www.yahoo.com == 64.58.76.223
All DNS queries are isolated to the subroutine leansock which is defined in the library "common_admin.pl". Queries are done with gethostbyname if and only if there is not already an "H:host" entry in the cache. So, to override or force a DNS response, simply customize the first few lines of leansock from:
sub leansock {
my ($host,$port,$p_socket,$p_nc_cache) = @_;
my $err = '';
with:
sub leansock {
my ($host,$port,$p_socket,$p_nc_cache) = @_;
$$p_nc_cache{"H:www.xav.com"} = pack('C4',split(m!\.!,'209.68.17.186'));
$$p_nc_cache{"H:www.whitehouse.gov"} = pack('C4',split(m!\.!,'166.90.133.200'));
my $err = '';
Enter as many overrides as needed, one per line, before the initialization of the $err string. All DNS lookups will now resolve to your custom IP addresses.
Obviously, this method is poor because if the DNS names ever start to resolve to a different IP address, you will have to customize your source code again, and in the meantime there may be some hard-to-debug errors. This solution is also not practical for FDSE instances where you plan to spider many different sites. Getting your system administrators to configure valid DNS entries and to run on normal architectures is a much better solution.
* In these cases you also may want to build your index files on a remote system where DNS works normally, and then simply FTP the completed index files up to your main search server. See Building index files remotely.
** In these cases, you might also be able to use "file system discovery" realms to bypass the network issues altogether. From the Admin Page, go to "Manage Realms" and then "Create New Realm". For web sites hosted on the same physical web server as FDSE, you can create a realm using File System Discovery by entering the folder that holds the web site files on the Create New Realm page.
History: the code sample above about subroutine leansock will work with FDSE versions back through 2.0.0.0036. All versions of FDSE have used a global hash to cache DNS lookups, but in earlier versions that caching code was located in different places. In those earlier versions, searching on "gethostbyname" should take you directly to the lookup code.
"How to force or override DNS lookups"
http://www.xav.com/scripts/search/help/1107.html