MS Word

Remove Office-Specific Tags from Web Pages

The Problem:

When Word renders a document into HTML, it puts weird tags in it. How can I remove the excess tags?

The Solution:

The tags you're objecting to are the Office-specific tags that Word uses to store information needed to re-create the document in its entirety. These tags store all kinds of information that's not displayed in the document, such as author and editing information; menu, toolbar, and keyboard customizations in the document; and even VBA code (macros, user forms, and classes). This extra information not only makes your web pages larger than necessary but also threatens your privacy.

The process of exporting an entire document to HTML so that it can be brought back into Word without any loss is called round-tripping. Word's "Web Page" and "Single File Web Page" formats save the data for round-tripping, while the "Web Page, Filtered" format does not. Use "Web Page, Filtered" for pages you want to put on your web site, but be warned that Word's HTML is verbose. If you know HTML, you may prefer to save a document in "Web Page, Filtered" format, open it in a text editor or HTML editor, and strip out unnecessary information manually before posting it to your web site.

To get rid of the Office-specific tags in Word 2003 or Word XP, choose File » Save as Web Page and then choose Web Page, Filtered in the "Save as type" drop-down list. Specify the filename, folder, and title, and click Save. When Word warns you that Office-specific tags will be removed (see Figure 2-6), click the Yes button.

Figure 2-6. When saving a Word document as a web page, you can strip out the Office-specific tags by using the "Web Page, Filtered" format.

Word 2000 doesn't offer a built-in option for stripping out Office-specific tags. Your best bet is to use a third-party utility, such as the free HTMLTidy (