CGI and Perl

Listing 9.6. Converting a relative URL to an absolute URL.

sub getAbsoluteURL {
   $pURL = new URI::URL $parent;
   $cURL = new URI::URL $current;
   if ($cURL->scheme() eq `http') {
      if ($cURL->host() eq "") {
      } else {
   return $absURL;

The only remaining function besides the main program is writeToLog(). This is a very straightforward function. All you need to do is open the log file and write a line containing the title and URL. For simplicity, write each to separate lines, thus avoiding having to parse anything during lookup. All titles will be on odd-numbered lines and all URLs on the even-numbered lines immediately following the title. If a document has no title, a blank line will appear where the title would have been. Listing 9.7 shows the writeToLog() function.

Listing 9.7. Writing the title and URL to the log file.

sub writeToLog {
    if (open(OUT,">> $logFile")) {
       print OUT "$title\n";
       print OUT "$url\n";
    } else {
       warn("Could not open $logFile for append! $!\n");

Now you can put this all together in the main program. The program will accept multiple URLs as starting points. You'll also specify a maximum depth of 20 recursive calls. Listing 9.8 shows the code for specifying these criteria.

Listing 9.8. Specifying the starting points and stopping points.

 use URI::URL;
 use LWP::UserAgent;use HTTP::Request;
 use HTML::Parse;
 use HTML::Element;
 my($ua) = new LWP::UserAgent;
 if (defined($ENV{`HTTP_PROXY'})) {
 foreach $url (@ARGV) {

There is another module available called RobotRules that will make it easier for you to abide by the Standard for Robot Exclusion. This module parses a file called robots.txt in the remote directory to see find out if robots are allowed at the site. For more information on the Standard for Robot Exclusion refer to

by BrainBellupdated