OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

office-comment message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [office-comment] shorter XML representations for the values ?


Jérôme Bouat <jerome.bouat@wanadoo.fr> wrote on 01/04/2015 05:16:28 AM:

> From: Jérôme Bouat <jerome.bouat@wanadoo.fr>

> To: office-comment@lists.oasis-open.org
> Date: 01/04/2015 05:16 AM
> Subject: [office-comment] shorter XML representations for the values ?
>
> Hello,
>
>
> The 1.2 version of Open Document Format for Office Applications
> specifies how the datatypes are represented into the part 1, section 18.
>
> This current representation of values is sometimes verbose. For
> example the "false" characters list represents the false boolean
> value (40 bits with UTF-8 in order to represent 1 bit value). I
> understand that those representations provide human readable values
> but I think we could use a shorter representation that still be
> readable by a human. For example, the "F" or "0" characters are
> still readable when they represent the false boolean value.
>
> Even if the size of the compressed "content.xml" member would not
> decrease a lot, the current representation infers more bytes to be
> read/written to memory when compressing/uncompressing and more bytes
> to be parsed by the application when loading a data value into its
> internal representation.
>
>
> For example, as discussed above, a boolean is represented as below :
> ---
> <table:table-cell table:style-name="ce1" office:value-type="boolean"
> office:boolean-value="false">
> <text:p>FAUX</text:p>
> </table:table-cell>
> ---
>
> Could we possibly use a shorter representation for the
> "office:boolean-value" attribute like "T" or "1" for the true value
> and "F" or "0" for the false value ?
>


Hello Jérôme,

Thanks for writing.

One solution for the boolean issue would be to harmonize our office:value-type attribute with XML Schema datatypes, at least for the common overlap in types.  XML Schema's boolean type allows lexical forms to be one of: true, false, 1, 0.   That would allow a more compact form.

>
> For example, a floating point number is represented as below :
> ---
> <table:table-cell office:value-type="float" office:value="123456789012345">
> <text:p>123456789012345</text:p>
> </table:table-cell>
> ---
>
> By using a base 62 representation (symbols by increasing weight :
> 0-9 letters, a-z letters, A-Z letters), the value of the
> "office:value" attribute becomes "z3wBXdvb". The size of this base
> 62 representation is roughly the half of the size of its base 10
> representation. The application will have roughly half bytes less to
> process in order to load the number into its internal
> representation. This would possibly increase the performance of
> applications when reading/writing large files.
>
>


That would add considerable complexity on ODF processors, including byte-order concerns.   The nice thing about using XML Schema datatypes is that they are well known and supported in tools.  In particular, validating parsers can apply additional constraints.  So a use could easily write a script, using just off-the-shelf XML tools, to confirm that all cells in an ODF spreadsheet have values between -50 and 1000.   But if values are encoded like "z3wBXdvb" then it would require custom coding to make sense of that value.

One way to think of this:  adherence to well-known standards provides an efficiency of its own, in terms of understanding, compatibility with existing tools, etc.   But it might not be the optimal in terms of run-time performance.    An alternative here -- which we've talked about before -- would be to have a canonical binary encoding of ODF.    Microsoft does something similar with Excel, having the XML-based OOXML format, but also having a specialized .xlsb format for optimized storage of very large spreadsheets.

Regards,

-Rob

> Could the next specifications possibly shorten the length of the XML
> characters lists which represents the values ?
>
> Are there any stoppers which prevents this change to be performed in
> the next versions ?
>
>
> Regards.
>
> --
> This publicly archived list offers a means to provide input to the
> OASIS Open Document Format for Office Applications (OpenDocument) TC.
>
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
>
> Subscribe: office-comment-subscribe@lists.oasis-open.org
> Unsubscribe: office-comment-unsubscribe@lists.oasis-open.org
> List help: office-comment-help@lists.oasis-open.org
> List archive:
http://lists.oasis-open.org/archives/office-comment/
> Feedback License:
http://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines:
http://www.oasis-open.org/maillists/guidelines.php
> Committee:
http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office
> Join OASIS:
http://www.oasis-open.org/join/
>


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]