Getting to Know Google Sitemaps

XML enters the picture in regard to web crawling thanks to a relatively new service offered by Google called Sitemaps. Google Sitemaps is a beta technology that makes it possible to feed the Google search engine information about your web pages to assist Google's web crawlers. In some ways Sitemaps works similarly to RSS (Really Simple Syndication), which you learn about in Tutorial 24, "Syndicating the Web with RSS News Feeds." RSS is used to notify visitors to your web site about changes to the site's content. Sitemaps works in much the same way except the recipient of a Sitemap "feed" is Google, as opposed to visitors to your web site.

By the Way

Google Sitemaps is a beta technology, which means that it is still under development. This also means that the service is subject to undergoing significant changes as its designers continue to develop and refine it. So know as you're using the service that it is subject to change.

A Google Sitemap is an XML document or documents that contain a list of URLs of pages within your web site that you want to be crawled. For the utmost in accurate crawling, you should list the URL of every page on your site, which can potentially be quite a few pages. It's not uncommon for even relatively small web sites to have hundreds of pages, while larger sites can reach into the thousands or even tens of thousands of pages. Many sites also have dynamic content that is generated by specifying attributes through a URL. Examples of such pages include Microsoft Active Server (ASP) pages and PHP pages, both of which result in unique pages solely by passing in attributes to the same document via the URL of the page. You'll want to try and include every combination of attributes for these pages in order to have your site thoroughly crawled.

In addition to the URL of a page, a Sitemap also allows you to provide some cues to the crawler in regard to how frequently the page is updated. More specifically, you can include the last modification date of a page, along with how frequently the page's content changes. Google doesn't hold you to this update frequency, by the way, it's just a general estimate to help determine how frequently the page should be crawled ideally.

The last piece of information that you associate with a Sitemap is the priority ranking of the page, which is relative to other pages on your site, not other pages on the Web in general. Your inclination might be to flag all of your pages as having the highest of priority but all you would be accomplishing is giving them equal priority with respect to each other. The idea behind the page priority is to provide a mechanism for giving more important pages a higher potential for getting crawled. As an example, you might want your home page to have a higher priority than say, your "about" page. Or maybe on a storefront site you want the product catalog pages to all have a higher priority than the company history page.

Google Sitemaps is valuable beyond just causing your web pages to be crawled more regularly. You may have some pages that are effectively on islands that would otherwise never be crawled. For example, maybe you have some pages in a knowledgebase that are only accessed via search queries. Because no permanent links exist for the pages, a normal web crawler would never find them. By placing the pages in a Sitemap, you ensure that the knowledgebase is indexed and included in Google search results.

When submitting a Sitemap to Google, you're notifying Google of the specific URLs that encompass your site, basically helping it along in its job of crawling your site thoroughly and accurately. When you add a new page to your site, you should update the Sitemap and resubmit it to Google so that the page is immediately targeted for crawling. It's a way to assist Google so that your web site content is always as synchronized as possible with search engine results.

By the Way

Google wasn't the first search engine to experiment with the concept of a submitted sitemap for the purpose of assisting its crawler. Yahoo! has a Content Acquisition Program that works roughly similar to Google Sitemaps except that you have to pay to use it. I would expect Microsoft to offer a service similar to Google Sitemaps at some point in the future seeing as how Microsoft is clearly making a run at Google with MSN Search.

After you've created a Sitemap document, you must publish it to your web site and then notify Google of its location. From there, everything else is automatic. The next section explains how to go about coding a Sitemap using XML, as well as how to submit it to Google.