CGI and Perl

Generating Web Indexes

The ability to generate a thorough Web index is a very hot commodity these days. Companies are now building an entire business case around the ability to provide Web users with the best search engine. These search engines are made possible using programs such as the one you'll see in the following example. Crawling through the Web to find all of the useful (as well as useless) pages is only one aspect of a good search engine. You then need to be able to categorize and index all of what you find into an efficient searchable pile of data. Our example will simply focus on the former.

As you can imagine, this kind of program could go on forever, so consider limiting the level of search to some reasonable depth. It is also important to abide by certain robot rules that include an exclusion protocol, identifying yourself with the User-agent field and notifying the sites that you plan to target. This will make your Web spider friendly to the rest of the Web community and will prevent you from being blacklisted. An automatic robot can cause an extremely large number of hits to occur on a given site, so please be sensitive to those sites of which you wish to obtain indexes.

Web RobotsSpiders

A Web robot is a program that silently visits Web sites, explores the links in that site, writes the URLs of the linked sites to disk, and continues in a recursive fashion until enough sites have been visited. Using the general purpose URL retriever in Listing 9.4 and a few regular expressions, you can easily construct such a program.

There are several classes available in the LWP modules that provide an easy way to parse HTML files to obtain the elements of interest. The HTML::Parse class allows you to parse an entire HTML file into a tree of HTML::Element objects. These classes can be used by our Web robot to easily obtain the title and all of the hyperlinks of an HTML document. You will first call the parse_htmlfile method in HTML::Parse to obtain a syntax tree of HTML::Element nodes. You can then use the extract_links method to enumerate all of the links or the traverse method to enumerate through all of the tags. My example will use the traverse method so we can locate the <TITLE> tag if it exists. The only other tag we will be interested in is the anchor element or the <A> tag.

You can also make use of the URI::URL module to determine what components of the URL are specified. This is useful for determining if the URL is relative or absolute.

Let's take a look at the crawlIt() function, which retrieves the URL, parses it, and traverses through the elements looking for links and the title. Listing 9.5 should look familiar--it's yet another way to reuse the code you've seen twice already.