XML

Repairing Invalid Documents

If you have any programming experience, the term "debugging" is no doubt familiar to you. If not, get ready because debugging is often the most difficult part of any software development project. Debugging refers to the process of finding and fixing errors in a software application. Depending on the complexity of the code in an application, debugging can get quite messy. The process of repairing invalid XML documents is in many ways similar to debugging software. However, XML isn't a programming language and XML documents aren't programs, which makes things considerably easier for XML developers. That's the good news.

The other good news is that validation tools give you a huge boost when it comes to making your XML documents free of errors. Not only do most validation tools alert you to the existence of errors in a document, but also most of them will give you a pretty good idea about where the errors are in the document. This is no small benefit. Even an experienced XML developer can overlook the most obvious errors after staring at code for long periods of time. Not only that, but XML is an extremely picky language, which leaves the door wide open for you to make mistakes. Errors are, unfortunately, a natural part of the development process, be it software, XML documents, or typing skills that you are developing.

So, knowing that your XML documents are bound to have a few mistakes, how do you go about finding and eliminating the errors? The first step is to run the document through a standard XML parser to check that the document is well formed. Remember that any validation tool will check if a document is well formed if you don't associate the document with a schema. As an example, the <oXygen/> XML Editor includes a toolbar button for simply checking that a document is well formed, as opposed to carrying out a full document validation (see Figure 8.7).

Figure 8.7. It is often helpful to test an XML document for well formedness before taking things to the next level and performing a full validation against a schema.

Check That Document is Well Formed

The first time you create a document, consider taking it for a spin through a validation tool without associating it with a schema. At this stage the tool will report only errors in the document that have to do with it being well formed. In other words, no validity checks will be made, which is fine for now.

Errors occurring during the well-formed check include typos in element and attribute names, unmatched tag pairs, and unquoted attribute values, to name a few. These errors should be relatively easy to find, and at some point you should get pretty good at creating documents that are close to being well formed on the first try. In other words, it isn't too terribly difficult to avoid the errors that keep a document from being well formed.

After you've determined that your document is well formed, you can wire it back to a schema and take a shot at checking it for validity. Don't be too disappointed if several errors are reported the first time around. Keep in mind that you are working with a very demanding technology in XML that insists on things being absolutely 100% accurate. You must use elements and attributes in the exact manner as they are laid out in a schema; anything else will lead to validity errors.

Perhaps the trickiest validity error is that of invalid nesting. If you accidentally close an element in the wrong place with a misplaced end tag, it can really confuse a validation tool and give you some strange results. Following is a simple example of what I'm talking about:

<session date="2001-11-19" type="running" heartrate="158">
  <duration>PT45M</duration>
  <distance units="miles">5.5</distance>
  <location>Warner Park</location>
  <comments>Mid-morning run, a little winded throughout.
</session>
</comments>

In this code the closing </comments> tag appears after the closing </session> tag, which is an overlap error because the entire comments element should be inside of the session element. The problem with this kind of error is that it often confuses the validation tool. There is no doubting that you'll get an error report, but it may not isolate the error as accurately as you had hoped. It's even possible for the validation tool to get confused to the extent that a domino effect results, where the single misplaced tag causes many other errors. So, if you get a slew of errors that don't seem to make much sense, study your document carefully and make sure all of your start and end tags match up properly.

Beyond the misplaced end tag problem, most validity errors are relatively easy to track down with the help of a good validation tool. Just pay close attention to the output of the tool, and tackle each error one at a time. With a little diligence, you can have valid documents without much work.