The What and Why of XML

With the universe expanding, human population increasing at an alarming rate across the globe, and a new boy band created every week, was it really necessary to introduce yet another web technology with yet another cryptic acronym? In the case of XML, the answer is yes. Next to HTML itself, XML is positioned to have the most widespread and long-term ramifications of any web technology to date. The interesting thing about XML is that its impact has gone and will continue to go largely unnoticed by most web users. Unlike HTML, which reveals itself in flashy text and graphics, XML is more of an under-the-hood kind of technology. If HTML is the fire engine red paint and supple leather interior of a sports car, XML is the turbocharged engine and sport suspension. Okay, maybe the sports car analogy is a bit much, but you get the idea that XML's impact on the Web is hard to see with the naked eye. However, the benefits are directly realized in all kinds of different ways. More specifically, if you've ever shopped on Amazon.com, purchased music from Apple iTunes, or read a syndicated news feed via RSS (Really Simple Syndication), you've used XML without realizing it.

By the way, you might as well get used to seeing loads of acronyms. Virtually every technology associated with XML has its own acronym, so it's impossible to learn about XML without getting to know a few dozen acronyms. Don't worry, I'll break them to you gently!

A Quick History of HTML

To understand the need for XML, at least as it applies to the Web, you have to first consider the role of HTML. In the early days of the Internet, some European physicists created HTML by simplifying another markup language known as SGML (Standard Generalized Markup Language). I won't get into the details of SGML, but let's just say it was overly complicated, at least for the purpose of sharing scientific documents on the Internet. So, pioneering physicists created a simplified version of SGML called HTML that could be used to create what we now know as web pages. The creation of HTML represented the birth of the World Wide Weba layer of visual documents that resides on the global network known as the Internet.

HTML was great in its early days because it allowed scientists to share information over the Internet in an efficient and relatively structured manner. It wasn't until later that HTML started to become an all-encompassing formatting and display language for web pages. It didn't take long before web browsers caught on and HTML started being used to code more than scientific papers. HTML quickly went from a tidy little markup language for researchers to a full-blown online publishing language. And once it was established that HTML could be jazzed up simply by adding new tags, the creators of web browsers pretty much went crazy by adding lots of nifty features to the language. Although these new features were neat at first, they compromised the simplicity of HTML and introduced lots of inconsistencies when it came to how browsers rendered web pages. HTML had started to resemble a bad remodeling job on a house that really should've been left alone.

As with most revolutions, the birth of the Web was very chaotic, and the modifications to HTML reflected that chaos. More recently, a significant effort has been made to reel in the inconsistencies of HTML and to attempt to restore some order to the language. The problem with disorder in HTML is that web browsers have to guess at how a page is to be displayed, which is not a good thing. Ideally, a web page designer should be able to define exactly how a page is to look and have it look the same regardless of what kind of browser or operating system someone is using. This utopia is still off in the future somewhere, but XML is playing a significant role in leading us toward it, and significant progress has been made.

Getting Multilingual with XML

XML is a meta-language, which is a fancy way of saying that it is a language used to create other markup languages. I know this sounds a little strange, but it really just means that XML provides a basic structure and set of rules to which any markup language must adhere. Using XML, you can create a unique markup language to model just about any kind of information, including web page content. Knowing that XML is a language for creating other markup languages, you could create your own version of HTML using XML. You could also create a markup language called VPML (Virtual Pet Markup Language), for example, which you could use to create and manage virtual pets. The point is that XML lays the ground rules for organizing information in a consistent manner, and that information can be anything from web pages to virtual pets.

Throughout these tutorials you will learn about several of the more intriguing markup languages that are based on XML. For example, you will find out about SVG and RSS, which allow you to create vector graphics and syndicate news feeds from web sites, respectively.

You might be thinking that virtual pets don't necessarily have anything to do with the Web, so why mention them? The reason is because XML is not entirely about web pages. In fact, XML in the purest sense really has nothing to do with the Web, and can be used to represent any kind of information on any kind of computer. If you can visualize all the information whizzing around the globe between computers, mobile phones, televisions, and radios, you can start to understand why XML has much broader ramifications than just cleaning up web pages. However, one of the first applications of XML is to restore some order to the Web, which is why I've provided an explanation of XML with the Web in mind. Besides, one of the main benefits of XML is the ability to develop XML documents once and then have them viewable on a range of devices, such as desktop computers, handheld computers, mobile phones, and Internet appliances.

One of the really awesome things about XML is that it looks very familiar to anyone who has used HTML to create web pages. Going back to our virtual pet example, check out the following XML code, which reveals what a hypothetical VPML document might look like:

  <pet name="Maximillian" type="pot bellied pig" age="3">
   <friend name="Augustus"/>
   <friend name="Nigel"/>

  <pet name="Augustus" type="goat" age="2">
   <friend name="Maximillian"/>

  <pet name="Nigel" type="chipmunk" age="2">
   <friend name="Maximillian"/>

This XML (VPML) code includes three virtual pets: Maximillian the pot-bellied pig, Augustus the goat, and Nigel the chipmunk. If you study the code, you'll notice that tags are used to describe the virtual pets much as tags are used in HTML code to describe web pages. However, in this example the tags are unique to the VPML language. It's not too hard to understand the meaning of the code, thanks to the descriptive tags. In fact, an important design parameter of XML was for XML content to always be human-readable. By studying the VPML code for a few seconds, it becomes apparent that Maximillian is friends with both Augustus and Nigel, but Augustus and Nigel aren't friends with each other. Maybe it's because they are the same age, or maybe it's just that Maximillian is a particularly friendly pig. Either way, the code describes several pets along with the relationships between them. This is a good example of the flexibility of the XML language. Keep in mind that you could create a virtual pet application that used VPML to share information with other virtual pet owners.

Unlike HTML, which consists of a predefined set of tags such as <head>, <body>, and <p>, XML allows you to create custom markup languages with tags that are unique to a certain type of data, such as virtual pets.

The virtual pet example demonstrates how flexible XML is in solving data structuring problems. Unlike a traditional database, XML data is pure text, which means it can be processed and manipulated very easily, in addition to being readable by people. For example, you can open up any XML document in a text editor such as Windows Notepad (or TextEdit on Macintosh computers) and view or edit the code. The fact that XML is pure text also makes it very easy for applications to transfer data between one another, across networks, and also across different computing platforms such as Windows, Macintosh, and Linux. XML essentially establishes a platform-neutral means of structuring data, which is ideal for networked applications, including web-based applications.

XML isn't just for web-based applications, however. As an example, the entire Microsoft Office line of products use XML under the hood to store and share document data.

The Convergence of HTML and XML

Just as some Americans are apprehensive about the proliferation of spoken languages other than English, some web developers initially feared XML's role in the future of the Web. Although I'm sure a few HTML purists still exist, is it valid to view XML as posing a risk to the future of HTML? And if you're currently an HTML expert and have yet to explore XML, will you have to throw all you know out the window and start anew with XML? The answer to both of these questions is a resounding no! In fact, once you fully come to terms with the relationship between XML and HTML, you'll realize that XML actually complements HTML as a web technology. Perhaps more interesting is the fact that XML is in many ways a parent to HTML, as opposed to a rival sibling more on this relationship in a moment.

Earlier in the tutorial I mentioned that the main problem with HTML is that it got somewhat messy and unstructured, resulting in a lot of confusion surrounding the manner in which web browsers render web pages. To better understand XML and its relationship to HTML, you need to know why HTML has gotten messy. HTML was originally designed as a means of sharing written ideas among scientific researchers. I say "written ideas" because there were no graphics or images in the early versions of HTML. So, in its inception, HTML was never intended to support fancy graphics, formatting, or page-layout features. Instead, HTML was intended to focus on the meaning of information, or the content of information. It wasn't until web browser vendors got excited that HTML was expanded to address the presentation of information. In fact, HTML was in many ways changed to focus entirely on how information appears, which is what ultimately prompted the creation of XML.

You'll learn throughout these tutorials that one of the main goals of XML is to separate the meaning of information from the presentation of it. There are a variety of reasons why this is a good idea, and they all have to do with improving the organization and structure of information. Although presentation plays an important role in any web site, modern web applications have evolved to become driven by data of very specific types, such as financial transactions. HTML is a very poor markup language for representing such data. With its support for custom markup languages, XML makes it possible to carefully describe data and the relationships between pieces of data. By focusing on content, XML allows you to describe the information in web documents. More importantly, XML makes it possible to precisely describe information that is shuttled across the Net between applications. For example, Amazon.com uses XML to describe products on its site and allow developers to create applications that intelligently analyze and extract information about those products.

You might have noticed that I've often used the word "document" instead of "page" when referring to XML data. You can no longer think of the web as a bunch of linked pages. Instead, you should think of it as linked documents. Although this may seem like a picky distinction, it reveals a lot about the perception of web content. A page is an inherently visual thing, whereas a document can be anything ranging from a stock quote to a virtual pet to a music CD on Amazon.com.

If XML describes data better than HTML, does it mean that XML is set to upstage HTML as the markup language of choice for the Web? Not exactly. XML is not a replacement for HTML, or even a competitor of HTML. XML's impact on HTML has to do more with cleaning up HTML than it does with dramatically altering HTML. The best way to compare XML and HTML is to remember that XML establishes a set of strict rules that any markup language must follow. HTML is a relatively unstructured markup language that could benefit from the rules of XML. The natural merger of the two technologies is to make HTML adhere to the rules and structure of XML. To accomplish this merger, a new version of HTML has been formulated that adheres to the stricter rules of XML. The new XML-compliant version of HTML is known as XHTML. You learn a great deal more about XHTML in Adding Structure to the Web with XHTML, "Adding Structure to the Web with XHTML." For now, just understand that one long-term impact XML will have on the Web has to do with cleaning up HTML.

Most standardized web technologies, such as HTML and XML, are overseen by the W3C, or the World Wide Web Consortium, which is an organizational body that helps to set standards for the Web. You can learn more about the W3C by visiting its web site at http://www.w3.org/.

XML's relationship with HTML doesn't end with XHTML, however. Although XHTML is a great idea that is already making web pages cleaner and more consistent for web browsers to display, we're a ways off from seeing a Web that consists of cleanly structured XHTML documents (pages). It's currently still too convenient to take advantage of the freewheeling flexibility of the HTML language. Where XML is making a significant immediate impact on the Web is in web-based applications that must shuttle data across the Internet. XML is an excellent medium for representing data that is transferred back and forth across the Internet as part of a complete web-based application. In this way, XML is used as a behind-the-scenes data transport language, whereas HTML is still used to display traditional web pages to the user. This is evidence that XML and HTML can coexist happily both now and into the future.