CGI and Perl

Parsing HTTP Logfiles

As Webmaster, you may be called upon, from time to time, to provide a report of the usage of your Web pages. There may be several reasons for this, not the least of which may be to justify your existence :-). More likely, though, the need will be to get a feel for the general usage model of your Web site or what types of errors are occurring.

Most of the available httpd servers provide you with an access log by default, along with some sort of an error log. Each of these logs has a separate format for its records, but there are a number of common fields, which naturally lends to the object-oriented model for parsing them and producing reports.

We'll be looking at the Logfile module, written by Ulrich Pfeifer, in this section. It provides you with the ability to subclass the base record object and has subclass modules available for a number of servers' log files, including NCSA httpd, Apache httpd, CERN httpd, WUFTP, and others. If there isn't a subclass for your particular server, it's pretty easy to write one. General Issues An HTTP server implements its logging according to configuration settings, usually within the httpd.conf file. The data you have to analyze depends on which log files you enable in the configuration file, or at compile time for the server's source in the case of the Apache server. Several logs can be enabled in the configuration, including the access log, error log, referer log, and agent log. Each of these has information that you may need to summarize or analyze.

Logging Connections

There are some security and privacy issues related to logging too much information. Be sure to keep the appropriate permissions on your logfiles to prevent arbitrary snooping or parsing, and truncate them when you've completed the data gathering. See Chapter 3 for more details.

In general, the httpd log file is a text file with records as lines terminated with the appropriate line terminator for the architecture under which the server is running. The individual records have fields that are strings that form dates, file paths, and hostnames or IP numbers, and other items, usually separated by blank space. Ordinarily, there is one line or record per connection, but some types of transactions generate multiple lines in the log file(s). This should be considered when designing the algorithm and code that parses the log.

The access log gives you general information regarding what site is connecting to your server and what files are being retrieved. The error log receives and records the output from the STDERR filehandle from all connections. Both of these, and especially the error log, may need to be parsed every now and then to see what's happening with your server's connections. Parsing Using the Logfile module, the discrete transaction record, based on some parameter of the request, is abstracted to a Perl object after being parsed. During the process of parsing the log file, the instance variables that are created with the new() method depend on which type of log is being parsed and which field (Hostname, Date, Path, and so on) from the log file you're interested in summarizing. When parsing is complete, the return value, a blessed reference to the Logfile class, has a hash with key/value pairs corresponding to the parameters on which you want to gather statistics about the log and the number of times each one was counted. In the simplest case, you simply write these lines:

use Logfile::Apache;  # to parse the popular Apache server log
 $l = new Logfile::Apache  File  => `/usr/local/etc/httpd/logs/access_log',
                             Group => [qw(Host Domain File)];

This parses your access log and returns the blessed reference. Reporting and Summaries After you've invoked the new() method for the Logfile class and passed in your log file to be parsed, you can invoke the report() method on the returned object.

$l->report(Group => File, Sort => Records, Top => 10);

The preceding line produces a report detailing the access counts of each of the top ten files retrieved from your archive and their percentages of the total number of retrievals. For the sample Apache access.conf log file included with the Log file distribution, the results from the report() method look like this:

File                                       Records
 =======================================
 /mall/os                                       5               35.71%
 /mall/web                                      3               21.43%
 /~watkins                                      3               21.43%
 /cgi-bin/mall                                  1                7.14%
 /graphics/bos-area-map                         1                7.14%
 /~rsalz                                        1                7.14%

You can generate many other reports with the Logfile module, including multiple-variable reports, to suit your needs and interests. See the Logfile documentation as embedded POD in Logfile.pm, for additional information. You can get the Logfile module from the CPAN, from Ulrich Pfeifer's author's directory:

~/authors/id/ULPFR/

The latest release, as of the writing of this chapter, was 0.113. Have a look, and don't forget to give feedback to the author when you can. Generating Graphical Data After you've gotten your reports back from Logfile, you've pretty much exhausted the functionality of the module. In order to produce an image that illustrates the data, you'll need to resort to other means. Because the report gives essentially two-dimensional data, it'll be easy to produce a representative image using the GD module, which was previously introduced in Chapter 12, "Multimedia."

This example provides you with a module that uses the GD class and provides one method to which you should pass a Logfile object, along with some other parameters to specify which field from the log file you wish to graph, the resultant image size, and the font. This method actually would be better placed into the Logfile::Base class, because that's where each of the Logfile subclasses, including the one for Apache logfiles, derive their base methods. It will be submitted to the author of the Logfile module after some additional testing.

For now, just drop the GD_Logfile.pm file (from Listing 14.4) into the Logfile directory in your @INC. You'll also need to have the GD extension and the Logfile module installed, of course. The GD_Logfile module uses the GD package to produce a GIF image of the graph corresponding to data from the report() method from the Logfile class. The entire module, including the graph() subroutine, looks like Listing 14.4.