CGI and Perl

Parsing, Converting, Editing, and Verifying HTML with Perl

One of the more important, but less well-documented duties of the Webmaster is updating and verifying the HTML in the archive's documents. Aside from the need for revision control, which we've already mentioned, how does one actually go about making changes, potentially en masse, to the archive's HTML documents? Once the changes have been performed, how does the Webmaster verify that they have not affected any other component of the archive? Fortunately, text management is one of the great strengths of Perl, and there are a number of modules and tools for accomplishing this task.

General Parsing Issues

The process of parsing an HTML document implies several algorithms. First, you must be able to recognize and possibly take action on each of the elements in the HTML specification, in the input stream, and on-the-fly. Usually, you'll wish to find the URLs or anchors in a document, but even this turns out to be non-trivial when you're attempting to match a URL with a single regular expression. Even the newer Perl5 Regular Expression Extensions don't completely solve the problem, partly because making the determination of validity depends on whether the URL is complete, partial, or relative. Fortunately, there is a Perl5 module devoted specifically to parsing HTML and URLs.

As it turns out, the best way to parse and determine the validity (but not necessarily the retrievability or existence) of a given URL is via a method chosen dynamically from a table or set based on the URL's protocol specifier. This sort of runtime decision-making is exactly how the URI::URL.pm module works, and using it saves you a lot of guesswork, testing, and/or debugging, and spares you from having to create potentially mind-boggling regular expressions to match the various types of URLs that exist.

When parsing HTML to find the embedded URLs, you'll also need to use the module called HTML::TreeBuilder. This module takes care of the gory details in parsing the other internal elements from an HTML document and builds an internal tree object to represent all the HTML elements within the file. These modules are part of the Web toolkit, called libwww. The complete suite of libwww modules includes the URI, HTML, HTTP, WWW, and Font classes. Libwww is written and maintained by Mr. Gisle Aas of Norway. The latest version is always available from his CPAN directory:

~authors/id/GAAS/

Listing 14.1 demonstrates how to use these modules to extract simple URLs from an HTML file.

Listing 14.1. simpleparse.

use URI::URL;
 use HTML::TreeBuilder;
 my($h,$link,$base,$url);
 $base = "test.html";
 $h = HTML::TreeBuilder->new;
 $h->parse_file($base);
 foreach $pair (@{$h->extract_links(qw<a img>)}) {
     my($link,$elem) = @$pair;
     $url = url($link,$base);
     print $url->abs,"\n";
 }

This short script prints out all the links in the file, test.html, whose attributes begin with an A or IMG tag. If you want to parse the file returned directly from the server, you would use the parse method, instead of the parse_file method. You'll also need to add the capability to slurp an HTML file directly from the server with the declaration

use LWP::Simple qw(get);

Now the script looks like that in Listing 14.2.