Microsoft Excel

Save to SpreadsheetML and Extracting Data

Since Excel XP, Excel has included an XML export option. SpreadsheetML provides an XML representation of your spreadsheets, complete with formatting and formula information.

Although there are several ways to read Excel spreadsheet files without using Excel, one of the easiest options is to export XML files that use Microsoft's SpreadsheetML vocabulary. SpreadsheetML isn't complete-most notably, charts and VBA code are omitted-but it does represent the core components of a spreadsheet, including formulas, named ranges, and formatting.

This tutorial uses Excel features that are available only in Excel XP and Excel 2003 on Windows. Earlier versions of Excel do not support this, and neither do current or announced Macintosh versions of Excel.

The easiest way to get started with SpreadsheetML is to save a spreadsheet as XML. The spreadsheet shown in the figure includes data, formulas, named ranges and cells, and some simple formatting.

Figure. A test spreadsheet for SpreadsheetML
figs/exhk_0814.gif

If you save the spreadsheet using the XML Spreadsheet (*.xml) format, which you can access by selecting File » Save As..., you'll get a long XML document containing the markup shown in example. Key portions are highlighted in bold.

Example. A SpreadsheetML document
<?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:o="urn:schemas-microsoft-com:office:office"
 xmlns:x="urn:schemas-microsoft-com:office:excel"
 xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:html="http://www.w3.org/TR/REC-html40">
 <DocumentProperties xmlns="urn:schemas-microsoft-com:office:office">
  <Author>XYZ</Author>
  <LastAuthor>XYZ</LastAuthor>
  <Created>2003-12-03T15:48:38Z</Created>
  <LastSaved>2004-01-26T21:04:14Z</LastSaved>
  <Company>ABC &amp; Inc</Company>
  <Version>11.5703</Version>
 </DocumentProperties>
 <ExcelWorkbook xmlns="urn:schemas-microsoft-com:office:excel">
  <WindowHeight>6150</WindowHeight>
  <WindowWidth>8475</WindowWidth>
  <WindowTopX>120</WindowTopX>
  <WindowTopY>30</WindowTopY>
  <ProtectStructure>False</ProtectStructure>
  <ProtectWindows>False</ProtectWindows>
 </ExcelWorkbook>
 <Styles>
  <Style ss:ID="Default" ss:Name="Normal">
   <Alignment ss:Vertical="Bottom"/>
   <Borders/>
   <Font/>
   <Interior/>
   <NumberFormat/>
   <Protection/>
  </Style>
  <Style ss:ID="s21">
   <NumberFormat ss:Format="mmm\-yy"/>
  </Style>
  <Style ss:ID="s22">
   <NumberFormat ss:Format="&quot;$&quot;#,##0.00"/>
  </Style>
  <Style ss:ID="s23">
   <Font x:Family="Swiss" ss:Bold="1"/>
  </Style>
 </Styles>
 <Names>
  <NamedRange ss:Name="Critters" ss:RefersTo="=Sheet1!R4C2:R11C2"/>
  <NamedRange ss:Name="Date" ss:RefersTo="=Sheet1!R1C2"/>
  <NamedRange ss:Name="ID" ss:RefersTo="=Sheet1!R4C1:R11C1"/>
  <NamedRange ss:Name="Price" ss:RefersTo="=Sheet1!R4C3:R11C3"/>
  <NamedRange ss:Name="Quantity" ss:RefersTo="=Sheet1!R4C4:R11C4"/>
  <NamedRange ss:Name="Total" ss:RefersTo="=Sheet1!R12C5"/>
 </Names>
 <Worksheet ss:Name="Sheet1">
  <Table ss:ExpandedColumnCount="5" ss:ExpandedRowCount="12" x:FullColumns="1"
   x:FullRows="1">
   <Column ss:AutoFitWidth="0" ss:Width="73.5"/>
   <Column ss:AutoFitWidth="0" ss:Width="96.75"/>
   <Column ss:Index="5" ss:AutoFitWidth="0" ss:Width="56.25"/>
   <Row>
    <Cell ss:StyleID="s23"><Data ss:Type="String">Sales for:</Data></Cell>
    <Cell ss:StyleID="s21"><Data ss:Type="DateTime">2004-01-01T00:00:00.000</
Data><NamedCell
      ss:Name="Date"/></Cell>
   </Row>
   <Row ss:Index="3" ss:StyleID="s23">
    <Cell><Data ss:Type="String">ID Number</Data></Cell>
    <Cell><Data ss:Type="String">Critter</Data></Cell>
    <Cell><Data ss:Type="String">Price</Data></Cell>
    <Cell><Data ss:Type="String">Quantity</Data></Cell>
    <Cell><Data ss:Type="String">Total</Data></Cell>
   </Row>
   <Row>
    <Cell><Data ss:Type="Number">4627</Data><NamedCell ss:Name="ID"/></Cell>
    <Cell><Data ss:Type="String">Diplodocus</Data><NamedCell ss:
Name="Critters"/></Cell>
    <Cell ss:StyleID="s22"><Data ss:Type="Number">22.5</Data><NamedCell
      ss:Name="Price"/></Cell>
    <Cell><Data ss:Type="Number">127</Data><NamedCell ss:Name="Quantity"/></
Cell>
    <Cell ss:StyleID="s22" ss:Formula="=RC[-2]*RC[-1]"><Data ss:Type="Number">
2857.5</Data></Cell>
   </Row>
   <Row>
    <Cell><Data ss:Type="Number">3912</Data><NamedCell ss:Name="ID"/></Cell>
    <Cell><Data ss:Type="String">Brontosaurus</Data><NamedCell ss:
Name="Critters"/></Cell>
    <Cell ss:StyleID="s22"><Data ss:Type="Number">17.5</Data><NamedCell
      ss:Name="Price"/></Cell>
    <Cell><Data ss:Type="Number">74</Data><NamedCell ss:Name="Quantity"/></
Cell>
    <Cell ss:StyleID="s22" ss:Formula="=RC[-2]*RC[-1]"><Data ss:Type="Number">1295</Data>
</Cell>
   </Row>
   <Row>
    <Cell><Data ss:Type="Number">9845</Data><NamedCell ss:Name="ID"/></Cell>
    <Cell><Data ss:Type="String">Triceratops</Data><NamedCell ss:
Name="Critters"/></Cell>
    <Cell ss:StyleID="s22"><Data ss:Type="Number">12</Data><NamedCell
      ss:Name="Price"/></Cell>
    <Cell><Data ss:Type="Number">91</Data><NamedCell ss:Name="Quantity"/></
Cell>
    <Cell ss:StyleID="s22" ss:Formula="=RC[-2]*RC[-1]"><Data ss:Type="Number">
1092</Data></Cell>
   </Row>
   <Row>
    <Cell><Data ss:Type="Number">9625</Data><NamedCell ss:Name="ID"/></Cell>
    <Cell><Data ss:Type="String">Vulcanodon</Data><NamedCell ss:
Name="Critters"/></Cell>
    <Cell ss:StyleID="s22"><Data ss:Type="Number">19</Data><NamedCell
      ss:Name="Price"/></Cell>
    <Cell><Data ss:Type="Number">108</Data><NamedCell ss:Name="Quantity"/></
Cell>
    <Cell ss:StyleID="s22" ss:Formula="=RC[-2]*RC[-1]"><Data ss:Type="Number">
2052</Data></Cell>
   </Row>
   <Row>
    <Cell><Data ss:Type="Number">5903</Data><NamedCell ss:Name="ID"/></Cell>
    <Cell><Data ss:Type="String">Stegosaurus</Data><NamedCell ss:
Name="Critters"/></Cell>
    <Cell ss:StyleID="s22"><Data ss:Type="Number">18.5</Data><NamedCell
      ss:Name="Price"/></Cell>
    <Cell><Data ss:Type="Number">63</Data><NamedCell ss:Name="Quantity"/></
Cell>
    <Cell ss:StyleID="s22" ss:Formula="=RC[-2]*RC[-1]"><Data ss:Type="Number">
1165.5</Data></Cell>
   </Row>
   <Row>
    <Cell><Data ss:Type="Number">1824</Data><NamedCell ss:Name="ID"/></Cell>
    <Cell><Data ss:Type="String">Monoclonius</Data><NamedCell ss:
Name="Critters"/></Cell>
    <Cell ss:StyleID="s22"><Data ss:Type="Number">16.5</Data><NamedCell
      ss:Name="Price"/></Cell>
    <Cell><Data ss:Type="Number">133</Data><NamedCell ss:Name="Quantity"/></
Cell>
    <Cell ss:StyleID="s22" ss:Formula="=RC[-2]*RC[-1]"><Data ss:Type="Number">
2194.5</Data></Cell>
   </Row>
   <Row>
    <Cell><Data ss:Type="Number">9728</Data><NamedCell ss:Name="ID"/></Cell>
    <Cell><Data ss:Type="String">Megalosaurus</Data><NamedCell ss:
Name="Critters"/></Cell>
    <Cell ss:StyleID="s22"><Data ss:Type="Number">23</Data><NamedCell
      ss:Name="Price"/></Cell>
    <Cell><Data ss:Type="Number">128</Data><NamedCell ss:Name="Quantity"/></
Cell>
    <Cell ss:StyleID="s22" ss:Formula="=RC[-2]*RC[-1]"><Data ss:Type="Number">
2944</Data></Cell>
   </Row>
   <Row>
    <Cell><Data ss:Type="Number">8649</Data><NamedCell ss:Name="ID"/></Cell>
    <Cell><Data ss:Type="String">Barosaurus</Data><NamedCell ss:
Name="Critters"/></Cell>
    <Cell ss:StyleID="s22"><Data ss:Type="Number">17</Data><NamedCell
      ss:Name="Price"/></Cell>
    <Cell><Data ss:Type="Number">91</Data><NamedCell ss:Name="Quantity"/></
Cell>
    <Cell ss:StyleID="s22" ss:Formula="=RC[-2]*RC[-1]"><Data ss:Type="Number">
1547</Data></Cell>
   </Row>
   <Row>
    <Cell ss:Index="4" ss:StyleID="s23"><Data ss:Type="String">Total:</Data></
Cell>
    <Cell ss:StyleID="s22" ss:Formula="=SUM(R[-8]C:R[-1]C)"><Data ss:
Type="Number">15147.5</Data><NamedCell
      ss:Name="Total"/></Cell>
   </Row>
  </Table>
  <WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel">
   <Print>
    <ValidPrinterInfo/>
    <HorizontalResolution>600</HorizontalResolution>
    <VerticalResolution>600</VerticalResolution>
   </Print>
   <Selected/>
   <Panes>
    <Pane>
     <Number>3</Number>
     <ActiveRow>11</ActiveRow>
     <ActiveCol>4</ActiveCol>
    </Pane>
   </Panes>
   <ProtectObjects>False</ProtectObjects>
   <ProtectScenarios>False</ProtectScenarios>
  </WorksheetOptions>
 </Worksheet>
 <Worksheet ss:Name="Sheet2">
  <WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel">
   <ProtectObjects>False</ProtectObjects>
   <ProtectScenarios>False</ProtectScenarios>
  </WorksheetOptions>
 </Worksheet>
 <Worksheet ss:Name="Sheet3">
  <WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel">
   <ProtectObjects>False</ProtectObjects>
   <ProtectScenarios>False</ProtectScenarios>
  </WorksheetOptions>
 </Worksheet>
</Workbook>

The first highlighted line, <?mso-application progid="Excel.Sheet"?>, is an XML processing instruction that tells Windows (actually a component Office 2003 adds to Windows) that this XML document is, in fact, an Excel spreadsheet. When Windows displays the file, it will have an Excel logo on it, and double-clicking it will open it in Excel.

The root element of the document, Worksheet, appears immediately after the processing instruction. Its attributes define namespaces used for various pieces of SpreadsheetML. The next few lines comprise mostly metadata, window presentation, and formatting information, and it isn't until you get to the Names and Worksheet elements that there's much worth examining closely.

The Names element identifies the named ranges and cells in the document. These two NamedRange elements define the Quantity named range-which extends from row 4, column 4, to row 11, column 4-and the Total named range, which is just the cell in row 12 of column 5:

  <NamedRange ss:Name="Quantity" ss:RefersTo="=Sheet1!R4C4:R11C4"/>
  <NamedRange ss:Name="Total" ss:RefersTo="=Sheet1!R12C5"/>

The meat of the spreadsheet is in the Worksheet element. It starts out by defining how large the actual table of data is:

 <Worksheet ss:Name="Sheet1">
  <Table ss:ExpandedColumnCount="5" ss:ExpandedRowCount="12" x:
FullColumns="1"
   x:FullRows="1">

This sheet, named Sheet1, used 5 columns and 12 rows. (The x:FullColumns and x:FullRows attributes are in another namespace that Excel won't use for layout.) The actual information in the table is stored in Row and Cell elements:

   <Row>
    <Cell ss:StyleID="s23"><Data ss:Type="String">Sales for:</Data></Cell>
    <Cell ss:StyleID="s21"><Data ss:Type="DateTime">2004-01-01T00:00:00.
000</Data><NamedCell
      ss:Name="Date"/></Cell>
   </Row>
   <Row ss:Index="3" ss:StyleID="s23">
    <Cell><Data ss:Type="String">ID Number</Data></Cell>

This Row, the first in the spreadsheet, contains two Cell elements. The first, formatted as s23 (bold, in this spreadsheet) and using the datatype String, contains the text "Sales for:". The second cell is formatted as s21 (plain) and uses the datatype DateTime. Its contents are given in a verbose ISO 8601 format. This cell also is part of a named range, in this case, "Date".

Most of the other Row elements follow similar patterns, but there are a few items worth extra attention. The second Row element has an extra attribute on it, ss:Index:

   <Row ss:Index="3" ss:StyleID="s23">

Excel doesn't represent empty rows or empty columns with empty Row or Cell elements. It just adds an ss:Index attribute to the next Row or Cell with content to tell you where you are. This requires programs that process this XML to pay a little more attention when assembling their tables. The other thing to watch is formulas:

    <Cell ss:StyleID="s22" ss:Formula="=SUM(R[-8]C:R[-1]C)"><Data ss:
Type="Number">15147.5</Data><NamedCell
      ss:Name="Total"/></Cell>

In previous page, this cell had a name of Total, a value of $15,147.50, and a formula of =SUM(E4:E11). All of those parts are here. But you must assemble them from the style of s22 (defined earlier in the document as a monetary number format), the value 15147.5, and a formula that uses relative references to say "the sum of the values in the same column as this one from 8 rows up to 1 row up."

This might not seem like much fun to process, but it's actually not that hard to do once you have an XML toolkit. You can use C#, Java, Perl, Python, VB, or your favorite XML-enabled programming language to extract the information, but we'll use XSLT to demonstrate.

There are many choices of XSLT processors out there, from command-line tools to Windows applications. You might want to try Architags XRay, available from http://architag.com/xray/, or Michael Kay's SAXON, at http://saxon.sourceforge.net/. Microsoft offers various XSLT tools, including a command-line tool, at http://msdn.microsoft.com/library/default.asp?url=/downloads/list/xmlgeneral.asp.

The stylesheet in example below, run against the XML in example above, will produce the much simpler XML in next example.

Example. An XSLT stylesheet for extracting content from the SpreadsheetML from above example
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns="http://simonstl.com/ns/dinosaurs/"
  xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
 >
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes" encoding="US-
ASCII"/>
<xsl:template match="/">
  <xsl:apply-templates select="ss:Workbook"/>
</xsl:template>
<xsl:template match="ss:Workbook">
  <dinosaurs>
       <xsl:apply-templates select="ss:Worksheet[@ss:Name = 'Sheet1']"/>
  </dinosaurs>
</xsl:template>
<xsl:template match="ss:Worksheet">
   <date><xsl:value-of select="ss:Table/ss:Row/ss:Cell[@ss:StyleID = 's21']" /
></date>
   <xsl:apply-templates select="ss:Table" />
</xsl:template>
<xsl:template match="ss:Table">
   <xsl:apply-templates select="ss:Row[position( ) &gt; 2]" />
<!--Note that because Excel skips the blank row, the third row is in position
2-->
</xsl:template>
<xsl:template match="ss:Row[ss:Cell[4]]">
<sale>
   <IDnum><xsl:apply-templates select="ss:Cell[1]" /></IDnum>
   <critter><xsl:apply-templates select="ss:Cell[2]" /></critter>
   <price><xsl:apply-templates select="ss:Cell[3]" /></price>
   <quantity><xsl:apply-templates select="ss:Cell[4]" /></quantity>
   <total><xsl:apply-templates select="ss:Cell[5]" /></total>
</sale>
</xsl:template>
<xsl:template match="ss:Row">
<total><xsl:apply-templates select="ss:Cell[2]" /></total>
</xsl:template>
</xsl:stylesheet>

The heart of the stylesheet is the template that matches all rows with four or more child cell elements. It extracts the information from the cells and puts it into XML elements that reflect the data, producing the result shown in example.

Example. Information extracted from SpreadsheetML to a custom XML vocabulary
<dinosaurs xmlns="http://simonstl.com/ns/dinosaurs/" xmlns:ss="urn:schemas-
microsoft-com:office:spreadsheet">
<date>2004-01-01T00:00:00.000</date>
<sale>
<IDnum>4627</IDnum>
<critter>Diplodocus</critter>
<price>22.5</price>
<quantity>127</quantity>
<total>2857.5</total>
</sale>
<sale>
<IDnum>3912</IDnum>
<critter>Brontosaurus</critter>
<price>17.5</price>
<quantity>74</quantity>
<total>1295</total>
</sale>
<sale>
<IDnum>9845</IDnum>
<critter>Triceratops</critter>
<price>12</price>
<quantity>91</quantity>
<total>1092</total>
</sale>
<sale>
<IDnum>9625</IDnum>
<critter>Vulcanodon</critter>
<price>19</price>
<quantity>108</quantity>
<total>2052</total>
</sale>
<sale>
<IDnum>5903</IDnum>
<critter>Stegosaurus</critter>
<price>18.5</price>
<quantity>63</quantity>
<total>1165.5</total>
</sale>
<sale>
<IDnum>1824</IDnum>
<critter>Monoclonius</critter>
<price>16.5</price>
<quantity>133</quantity>
<total>2194.5</total>
</sale>
<sale>
<IDnum>9728</IDnum>
<critter>Megalosaurus</critter>
<price>23</price>
<quantity>128</quantity>
<total>2944</total>
</sale>
<sale>
<IDnum>8649</IDnum>
<critter>Barosaurus</critter>
<price>17</price>
<quantity>91</quantity>
<total>1547</total>
</sale>
<total>15147.5</total>
</dinosaurs>

It's the same data, but in a very different form. The formula information has been discarded in this case, but because Excel provides the values as well as the formulas, this particular application didn't need to understand the formulas.