Notes from the Standards NZ meeting about OOXML

Hi folks, these began as my personal notes from the Standards NZ meeting. I've now fleshed them out with references and links to make them easier to follow.
Each heading is about a topic or question we had, with some background and my suggested action for Standards NZ.
They may not make sense without knowing the context of the meeting and/or what we talked about. If you have any questions please ask and I'll try to post an explanation and update the article with corrections.

Harmonizing the Formats

I think this should be the end goal, to merge the formats.

It was suggested that this would create a 3rd format. It's not about creating a 3rd format because of course this harmonized format would become the new ODF. It would be about removing unnecessary and pointless differences between ODF and OOXML and making it easier for me (or anyone) to develop with office suites.

My presentation* explained ECMAs given reasons for believing that they are too different but I hope it was clear to everyone that page breaks, table handling, and cell styles aren't any significant technical problem (Gray also mentioned the "mixed content model" as a reason why they can't be merged -- this is a data modelling issue unrelated to any feature set and so it doesn't affect harmonizing the formats as I understand it).

Also there'd be a lot more software to choose from when it's not such a divisive market.

Technically it can be done and many others in the XML and document community think so too. The co-creator of XML itself, Tim Bray says so and so do people from Microsoft such as Alan Yates. We even heard from Gray that it could take 2 years. It could easily take that amount of time to fix the existing problems in OOXML.

Office Suite formats do not move as quick as other parts of the computing industry. We still have to deal with files from Office 2003, or Office '97, for example. I'm happy with around 2-4 years to harmonize the formats considering the benefits it would bring.

Specific Issues Raised By ECMA

Please see the Presentation on Ecma Points in OpenDocument [0.5MB] or Presentation on Ecma Points in PDF [0.4MB].

  • It's of note that in the first 3 cases (page breaks, unified nested tables, and cell styles) the supposed reasons why the formats can't be harmonized are because ODF supports more features than OOXML. Or in graphical terms, it's a feature set comparison that looks like this (SVG). For more information on these see this blog post from a guy who writes conversion software (who's not me!).
  • The Formulas issue... the ODF 1.2 which is due October 2007 does have formulas but the current ISO ODF doesn't. OOXML does have formulas and I comment on the quality of these formulas below in "Perpetuating Bugs". To fix these bugs will take time, quite possibly enough time for ODF 1.2 with OpenFormula. Here's a quote from Wikipedia which helps clear up the timelines,
    • "Microsoft continued to protest that OpenDocument could not be used because it did not define a format for spreadsheet formulas, yet its own specification continued to omit any specification about formulas through April 2006. Finally, in May 2006, Microsoft also began defining formulas in its XML format, 15 months after the first version of OpenFormula and 3 months after OASIS posted its first official draft of its specification."
    • -- Wikipedia OpenFormula Timeline
  • The "mixed content model" difference isn't an issue at all as I understand it. See this document demonstrating that it's a data modelling issue unrelated to any feature-set.
  • I incorrectly stated in my presentation that ODF did not have custom schemas. I was wrong -- in ODF 1.2 Draft (due October 2007) there is indeed custom schema support.

Māori, Multiculturalism

OOXML and ODF support Unicode (ISO/IEC 10646) in order to render all the letters and characters needed for Māori, many Polynesian, Asian, Aborigine languages, English, and many more.

So characters themselves are well dealt with but cultural expressiveness -- in the case of Office suites -- would of course allow cultural styles, calendar holidays for the Māori new year, etc.

OOXML contains mostly American and/or Christian symbols (Christmas Trees, Easter Eggs, etc.) in a fixed and non-extensible list. This means that we can't add a border style of Matariki (Māori New Year stars), Koru Borders, Taniwhas, or other kiwi styles.

These fixed lists are detailed in sections 2.18.4 (p. 2414, "Border Styles"), 5.1.12.56 (p. 4557, "Preset Shape Types"), and 5.1.12.76 (p. 4645, "Preset Text Shape Types").

It should be instead changed to allow either arbitrary images, or arbitrary images as well as the existing fixed list, in order to allow multi-cultural styles.

(by the way -- ODF solved this through the first suggested solution of using arbitrary images)

Human Readability of XML

There was a comparison of <cell> (ODF) vs <c> (OOXML) in spreadsheets and which was faster. The goals of XML say,

  • "6. XML documents should be human-legible and reasonably clear.
    10. Terseness in XML markup is of minimal importance."
    -- W3C: Goals of XML

Clearly contradicts the goals of XML and best practices.

The designers of XML knew what they were doing because while we can remember what "c" means in this case it becomes problematic when we get hundreds or thousands of these shorthand references. HTML, the web page language, has some shorthand references like this but then there are only around 20 things to memorize, so in practice it's not a problem. OOXML has hundreds of these cryptic names.

This kind of naming convention stuff happens all the time in technical documents, so it's good to see that ISO Directives, Part 2 "Rules for the Structure and Drafting of International Standards", section 4.3, sets out requirements for avoiding this type of inconsistent use of terminology.

  • "Analogous wording shall be used to express analogous provisions; identical wording shall be used to express identical provisions"

Here's an example of identical wording referring to different things in OOXML,

  • "The w:sz element is an example of major internal inconsistencies in the specifications measurements:
    • For fonts, the w:sz element specifies the size in half points (2.3.2.36, page 1013).
    • For frameset, the w:sz element has a string value that could be a relative value, a percentage, or a number of pixels (2.15.2.39, page 2136). The examples on page 2138 do not refer to w:sz at all.
    • However, as the child of rPr (3.4.11, page 2846), its value is in points."

    -- Grokdoc Objections

It should be changed to follow the goals and best practices of XML by using human-legible terms and distinct terminology.

See this blog post for more Open Malaysia: OOXML has poor XML Element names

Propagation of Historical Bugs

Within OOXML they have ways of dealing with some historical bugs (eg, autoSpaceLikeWord95). When a future revision of OOXML defines what autoSpaceLikeWord95 means then OOXML implementors will be able to distinguish bugs from how it should be. This is a good approach.

OpenFormula has a similar approach of adding additional flags to be compatible with historical bugs while preventing bug propagation in future documents. See this information on the OpenFormula CEILING function.

However this technique is used selectively within OOXML, Eg,

  • There is a known bug in Microsoft Excel that treats the year 1900 as a leap year. Changing the Gregorian calendar is not necessary (or the best way) to achieve compatibility with spreadsheets that depend on this bug.
    A better solution is to define the spec correctly, and when converting old binary files to the new format, Microsoft Office would (for example) replace WEEKDAY() by WEEKDAY()+1 for any dates affected by this bug. Alternatively, since they have compatibility flags for several other legacy bugs, this could be handled that way as well, e.g., when importing a legacy Excel document, set a flag "LeapYearBug=true", but when creating a new OOXML document this flag would not be set and dates would be described correctly.
    -- Grokdoc: Gregorian Calendar

Microsoft correctly stated that the 1900 bug in Microsoft Excel was done to emulate a bug in Lotus 123. Correct blame is good, but we should still squash this bug now.

I talked quite a bit about techniques for achieving this and there was no argument that this wasn't indeed possible at the meeting. They should fix the 1900 leap year bug, and the numerous mathematical bugs in SpreadSheetML such as CEILING, AVEDEV, ZTEST, CONFIDENCE, CONVERT, NETWORKDAYS, and at least another 30 more are affected. See these resources,

OOXML should fix the formulas and dates so that they remain compatible with them whilst not propagating their quirks to newly created documents.

Intellectual Property

The term Intellectual Property is of course not a specific thing but an umbrella term for Copyright, Patents, Trademarks, etc... Due dilligence...

Copyright

The clip art mentioned in sections 2.18.4, 5.1.12.56 and 5.1.12.76 should be checked for copyright issues.

Patents

The Microsoft Open Specification Promise has this line about patent grant... which does grant

  • "Microsoft-owned or Microsoft-controlled patents that are necessary to implement only the required portions of the Covered Specification that are described in detail and not merely referenced in such Specification."

The Covenant not to Sue has similar wording to do with a required subset of features:

  • "Microsoft irrevocably covenants that it will not seek to enforce any of its patent claims necessary to conform to the technical specifications [...]".

While I Am Not A Lawyer these plainly spoken sentences suggest to me that patents are granted only for a subset of features in OOXML, the required or necessary ones. It would be great to check what licensing the non-required parts are available under, and whether they meet the ISO RAND criteria, and even if my interpretation is slightly correct!

Internationally Microsoft commissioned analysis from London legal firm Baker & Mckenzie into the various patent grants. This analysis which tries to be a human friendly version of the various grants does not discuss the "required portions" distinction.

Trademarks

Would be good to see whether OOXML implementations could use trademarked terms from the spec in their advertising material, websites, etc. I don't know whether this is commonly done in ISO standards though -- is this too much to ask?

Accessibility

My main point here was in response to Chris Auld that the file formats such as OOXML affect accessibility, and that although there are screen readers there is accessibility software that deals with files directly. An example of accessibility software dealing with files directly is "Blynx" (great name for accessibility software, that). Another example is the planned software project that analyses ODF files for accessibility problems

A lot of accessibility benefits come from reusing existing tech (building upon existing standards which have accessibility software available). Other people have described the problems of accessibility in OOXML better than me, so here's a link

My Docvert software is currently used by disabled people to derive structure to poorly made word processing files, which helps them navigate documents (eg, because it can understand headings and such they can be read out section titles of the document and then narrow in on content without having to read the document linearly).

Was on the working group for the NZ E-government Web Guidelines by SSC as an accessibility and web standards guy. I also developed two versions of the E-government Website which of course had to include accessibility features.

It's of note that ODF 1.0 had some minor accessibility problems itself but these were addressed in ODF 1.1, and here's some info on how minor these problems were,

Proprietary Extensions and Technology

This one wasn't in my notes -- it's new information that's come to light since Thursday about Microsoft-specific technologies in OOXML.
These are some undocumented Microsoft tech present in OOXML,

  • SSPI ("Security Service Provider Interface") which is a proprietary Microsoft developed protocol for security providers.
  • OLE ("Object Linking and Embedding") which is for embedding (eg, taking an Excel spreadsheet and putting it into a Word document). This is undefined in OOXML only available on Microsoft Windows.
    • Update (30 August 2007): To clarify, on OSX/Mac Classic it's poorly implemented and not available outside Microsoft Apps, and on Linux it's there have been efforts to reverse engineer this within OpenOffice.org but as they don't have a standard to work from it's been slow work.

    I think it's important to note that on Windows many non-Microsoft Office Suites (such as OpenOffice, Word Perfect, etc.) have reverse engineered these but the point is that in OOXML it's still undefined and it's still Windows-only tech that
    doesn't work on other platforms.

    And this one wasn't in my notes either but it's new information from Stéphane Rodriguez,

    It's a bit too emotionally written for me personally but you can't deny the evidence. Mr Rodriguez is a brilliant techy.

    Comments are appreciated and please keep it civil :)

    [*] Presentation on Ecma Points in OpenDocument [0.5MB] and Presentation on Ecma Points in PDF [0.4MB]