Using Perl, it's relatively easy to convert any given document to an HTML representation, as long as you know the general layout of the document and have a mapping from the document's internal layout to a corresponding HTML element. General Issues
As we've stated, the general algorithm for converting a given document to HTML implies knowing something about the internals of the document. For ASCII text files, this is fairly easy to do. Binary formats require some additional investigation. Furthermore, some types of binary document formats are proprietary and may require that you obtain the structure from the owner or distributor in order to parse them properly. For the sake of brevity, we'll restrict our discussion to parsing and converting ordinary ASCII documents to HTML.
Mail Folders
The typical UNIX mail file is one example of a text file that has a known format and that's relatively easy to parse and convert to an HTML equivalent. The resulting HTML file may be as useful as a Web document when it contains the messages sent to a public mailing list, for instance. There are numerous means available for making a mail file searchable. Some searching algorithms have already been discussed in this tutorial; however, because we know the format of a mail file, we not only can search through it more easily, but we also can provide a hypertext view of the messages within. Converting a mail file to an HTML document provides the ability to browse all of the messages in the file in the order that they arrived, much like the standard MUA
interface, or in a threaded order, according to subject.
The MailArchive
tool is available at any CPAN site. It provides a script to process mail folders and store them in a way that can then be accessed through an index.html
, which it creates as it parses the mail file. It also provides a search library to allow keyword searches on the archive. The current version, as of this chapter, is 1.9, and its location in the CPAN is
~/authors/id/LFINI/MailArch-1.9.tar.gz.
http://www.oac.uci.edu/indiv/ehood/mhonarc.html
Another Perl tool for producing HTML from mail files and folders is called MHonArc, written by Earl Hood. It provides a powerful set of features for indexing, searching, and marking up HTML produced from mail folders. Get all the information about obtaining and using MHonArc from http://www.oac.uci.edu/indiv/ehood/mhonarc.html
.
The author of MHonArc, Earl Hood, has done extensive work on the topic of SGML DTD conversion to HTML, as well. While that topic is beyond the scope of this text (and this author :-), Mr. Hood's
perlSGML
package is also available through the CPAN and at the preceding site.