Creating Your Own Sitemap

You've already seen the nuts and bolts of the Sitemap protocol language, along with the basic template for a Sitemap XML document. It's time to take the next step and create your very own Sitemap document, or in reality, my very own Sitemap document because you're going to use URLs on my web site as examples. Let's get started!

A Basic Sitemap Document

Listing 20.1 contains the code for a complete Sitemap document for my web site. By complete, I mean that it meets all of the requirements of a Sitemap document, although it doesn't actually include URLs for all of the pages on my site.

Listing 20.1. A Complete Sitemap Document for My Web Site
 1: <?xml version="1.0" encoding="UTF-8"?>
 2: <urlset xmlns="http://www.google.com/schemas/sitemap/0.84"
 3:   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 4:   xsi:schemaLocation="http://www.google.com/schemas/sitemap/0.84
 5:   http://www.google.com/schemas/sitemap/0.84/sitemap.xsd">
 6:   <url>
 7:     <loc>http://www.xyz.com/</loc>
 8:     <lastmod>2005-08-23</lastmod>
 9:     <changefreq>daily</changefreq>
10:     <priority>1.0</priority>
11:   </url>
12:   <url>
13:     <loc>http://www.xyz.com/mambo/index.php?
14:     option=com_content&amp;task=category&amp;sectionid=2&amp;id=11&amp;
15:     <lastmod>2005-08-20</lastmod>
16:     <changefreq>weekly</changefreq>
17:     <priority>0.8</priority>
18:   </url>
19:   <url>
20:     <loc>http://www.xyz.com/mambo/index.php?
21:     option=com_content&amp;task=blogcategory&amp;id=16&amp;Itemid=37</loc>
22:     <lastmod>2005-08-23</lastmod>
23:     <changefreq>daily</changefreq>
24:     <priority>0.8</priority>
25:   </url>
26:   <url>
27:     <loc>http://www.xyz.com/mambo/index.php?
28:     option=com_simpleboard&amp;Itemid=48&amp;func=showcat&amp;catid=37</loc>
29:     <lastmod>2005-08-18</lastmod>
30:     <changefreq>daily</changefreq>
31:     <priority>0.6</priority>
32:   </url>
33:   <url>
34:     <loc>http://www.xyz.com/mambo/index.php?
35:     option=com_simpleboard&amp;Itemid=48&amp;func=showcat&amp;catid=36</loc>
36:     <lastmod>2005-08-21</lastmod>
37:     <changefreq>daily</changefreq>
38:     <priority>0.6</priority>
39:   </url>
40: </urlset>

This code is very similar to the Sitemap code you saw earlier in the chapter, except in this case multiple URLs are specified. Notice that I opted to use all of the optional tags in every URLthere's no good reason not to unless you just don't want to take the time to be so detailed. One thing I did skimp on a little is the <lastmod> tag for each URL, which I specified only as a date. However, because none of these pages are listed as having a change frequency higher than daily, it really isn't necessary to get more exacting with the modification date.

Google requires all Sitemap documents to use UTF-8 encoding (see line 1 in the example), as well as escaped entities for the following symbols: &, ', ", >, and <.

One other thing worth pointing out in this sample Sitemap is how I opted to use 0.6, 0.8, and 1.0 as the priority levels of the pages. Presumably, a more complete sample page would continue on down the range, finishing up at 0.0 for the least important pages.

A Sitemap document can only reference URLs in the same folder or a child folder of the location of the Sitemap file. In the sample in Listing 20.1, the file would need to be placed at the same level as http://www.xyz.com/ because that URL is hierarchically the highest URL in the Sitemap.

Breaking Your Mapping into Multiple Documents

If you're responsible for a really monstrous site with loads of pages, you may need to consider breaking up your Sitemap into multiple Sitemap documents, in which case you'll also need a Sitemap index document to pull them all together. A Sitemap index document is very similar to a normal Sitemap except that it uses the following tags:

  • <sitemapindex> The root element of a Sitemap index document, which serves as a container for individual <sitemap> elements

  • <sitemap> The storage unit for an individual Sitemap within a Sitemap index; serves as a container for the <loc> and <lastmod> elements

  • <loc> The URL of a Sitemap document; this tag is required

  • <lastmod> The date/time of the last change to the Sitemap document; this tag is optional

The first two tags work very similarly to the <urlset> and <url> tags in an individual Sitemap. However, instead of organizing URLs they organize other Sitemap documents. The remaining two tags, <loc> and <lastmod>, also work similarly to their individual Sitemap counterparts, except in this case they determine the location and last modification date of a Sitemap document, not a web page. Following is some code that demonstrates how a Sitemap index is assembled out of these tags:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.google.com/schemas/sitemap/0.84">

This sample code shows how to include two Sitemap documents in a Sitemap index. If you pay close attention, the two Sitemap documents in this example have been compressed using the gzip tool, which explains why their file extension is .xml.gz. Also notice that the date and time of each Sitemap is different, which means that an intelligent web crawler could focus on reindexing only the URLs in the newer Sitemap.

A Sitemap index document can only reference Sitemaps that are stored on the same site as the index file.

In the Sitemap index example I didn't include schema information for validating the Sitemap index. However, there is an XSD schema available from Google that you can use to validate Sitemap indexes just as you saw earlier how an XSD schema can be referenced in an individual Sitemap. The schema for individual Sitemaps is http://www.google.com/schemas/sitemap/0.84/sitemap.xsd, while the schema for Sitemap indexes is http://www.google.com/schemas/sitemap/0.84/siteindex.xsd.