[Previous] [Contents] [Next]
Parsing HTTP Logfiles
As Webmaster, you may be called upon, from time to time, to provide a report of the usage of your Web pages. There may be several reasons for this, not the least of which may be to justify your existence
:-). More likely, though, the need will be to get a feel for the general usage model of your Web site or what types of errors are occurring.
Most of the available httpd servers provide you with an access log by default, along with some sort of an error log. Each of these logs has a separate format for its records, but there are a number of common fields, which naturally lends to the object-oriented model for parsing them and producing reports.
We'll be looking at the
Logfile module, written by Ulrich Pfeifer, in this section. It provides you with the ability to subclass the base record object and has subclass modules available for a number of servers' log files, including NCSA httpd, Apache httpd, CERN httpd, WUFTP, and others. If there isn't a subclass for your particular server, it's pretty easy to write one.
General Issues An HTTP server implements its logging according to configuration settings, usually within the
httpd.conf file. The data you have to analyze depends on which log files you enable in the configuration file, or at compile time for the server's source in the case of the Apache server. Several logs can be enabled in the configuration, including the access log, error log, referer log, and agent log. Each of these has information that you may need to summarize or analyze.
There are some security and privacy issues related to logging too much information. Be sure to keep the appropriate permissions on your logfiles to prevent arbitrary snooping or parsing, and truncate them when you've completed the data gathering. See Chapter 3 for more details.
In general, the httpd log file is a text file with records as lines terminated with the appropriate line terminator for the architecture under which the server is running. The individual records have fields that are strings that form dates, file paths, and hostnames or IP numbers, and other items, usually separated by blank space. Ordinarily, there is one line or record per connection, but some types of transactions generate multiple lines in the log file(s). This should be considered when designing the algorithm and code that parses the log.
The access log gives you general information regarding what site is connecting to your server and what files are being retrieved. The error log receives and records the output from the
STDERR filehandle from all connections. Both of these, and especially the error log, may need to be parsed every now and then to see what's happening with your server's connections.
Parsing Using the
Logfile module, the discrete transaction record, based on some parameter of the request, is abstracted to a Perl object after being parsed. During the process of parsing the log file, the instance variables that are created with the
new() method depend on which type of log is being parsed and which field (Hostname, Date, Path, and so on) from the log file you're interested in summarizing. When parsing is complete, the return value, a blessed reference to the
Logfile class, has a hash with key/value pairs corresponding to the parameters on which you want to gather statistics about the log and the number of times each one was counted. In the simplest case, you simply write these lines:
use Logfile::Apache; # to parse the popular Apache server log
$l = new Logfile::Apache File => `/usr/local/etc/httpd/logs/access_log',
Group => [qw(Host Domain File)];
This parses your access log and returns the blessed reference.
Reporting and Summaries After you've invoked the
new() method for the
Logfile class and passed in your log file to be parsed, you can invoke the
report() method on the returned object.
$l->report(Group => File, Sort => Records, Top => 10);
The preceding line produces a report detailing the access counts of each of the top ten files retrieved from your archive and their percentages of the total number of retrievals. For the sample Apache
access.conf log file included with the Log file distribution, the results from the
report() method look like this:
/mall/os 5 35.71%
/mall/web 3 21.43%
/~watkins 3 21.43%
/cgi-bin/mall 1 7.14%
/graphics/bos-area-map 1 7.14%
/~rsalz 1 7.14%
You can generate many other reports with the
Logfile module, including multiple-variable reports, to suit your needs and interests. See the
Logfile documentation as embedded POD in
Logfile.pm, for additional information. You can get the
Logfile module from the CPAN, from Ulrich Pfeifer's author's directory:
The latest release, as of the writing of this chapter, was 0.113. Have a look, and don't forget to give feedback to the author when you can.
Generating Graphical Data After you've gotten your reports back from
Logfile, you've pretty much exhausted the functionality of the module. In order to produce an image that illustrates the data, you'll need to resort to other means. Because the report gives essentially two-dimensional data, it'll be easy to produce a representative image using the
GD module, which was previously introduced in Chapter 12, "Multimedia."
This example provides you with a module that uses the
GD class and provides one method to which you should pass a
Logfile object, along with some other parameters to specify which field from the log file you wish to graph, the resultant image size, and the font. This method actually would be better placed into the
Logfile::Base class, because that's where each of the
Logfile subclasses, including the one for Apache logfiles, derive their base methods. It will be submitted to the author of the
Logfile module after some additional testing.
For now, just drop the
GD_Logfile.pm file (from Listing 14.4) into the
Logfile directory in your
@INC. You'll also need to have the GD extension and the
Logfile module installed, of course. The
GD_Logfile module uses the
GD package to produce a GIF image of the graph corresponding to data from the
report() method from the
Logfile class. The entire module, including the
graph() subroutine, looks like Listing 14.4.
[Previous] [Contents] [Next]