CGI and Perl

Listing 9.6. Converting a relative URL to an absolute URL.

sub getAbsoluteURL {
   $pURL = new URI::URL $parent;
   $cURL = new URI::URL $current;
   if ($cURL->scheme() eq `http') {
      if ($cURL->host() eq "") {
      } else {
   return $absURL;

The only remaining function besides the main program is writeToLog(). This is a very straightforward function. All you need to do is open the log file and write a line containing the title and URL. For simplicity, write each to separate lines, thus avoiding having to parse anything during lookup. All titles will be on odd-numbered lines and all URLs on the even-numbered lines immediately following the title. If a document has no title, a blank line will appear where the title would have been. Listing 9.7 shows the writeToLog() function.

Listing 9.7. Writing the title and URL to the log file.

sub writeToLog {
    if (open(OUT,">> $logFile")) {
       print OUT "$title\n";
       print OUT "$url\n";
    } else {
       warn("Could not open $logFile for append! $!\n");

Now you can put this all together in the main program. The program will accept multiple URLs as starting points. You'll also specify a maximum depth of 20 recursive calls. Listing 9.8 shows the code for specifying these criteria.

Listing 9.8. Specifying the starting points and stopping points.

 use URI::URL;
 use LWP::UserAgent;use HTTP::Request;
 use HTML::Parse;
 use HTML::Element;
 my($ua) = new LWP::UserAgent;
 if (defined($ENV{`HTTP_PROXY'})) {
 foreach $url (@ARGV) {

There is another module available called RobotRules that will make it easier for you to abide by the Standard for Robot Exclusion. This module parses a file called robots.txt in the remote directory to see find out if robots are allowed at the site. For more information on the Standard for Robot Exclusion refer to