office-collab message

Subject: Re: [office-collab] Defining the Basics: The Search for Components

From: Svante Schubert <svante.schubert@gmail.com>
To: office-collab@lists.oasis-open.org
Date: Wed, 27 Mar 2013 14:55:11 +0100

Hi Oliver,

I agree, we might want to add an "in general", because (see below)..

On 27.03.2013 14:22, Oliver-Rainer Wittmann wrote:

Hi,

discussing "Component Search Criteria":
In general I agree.
The criteria "change to a component does not change other XML" is a little bit tricky.
Some examples why I think it is tricky:
(A) A deletion of a paragraph element (<text:p> or <text:h> element) will have influence on the content of a <text:paragraph-count> element elsewhere in the document.

(B) A deletion of a paragraph element of type heading (<text:h element) might have influence on the content of a <text:table-of-content> element elsewhere in the document.

XML does not only exist for being part of a component. There are two other reasons for XML, as being an "(aggregated) view" on the status of other components. This happens for a content table or the paragraph count element.

(C) A change of the content of a paragraph which is part of a list might have influence on the content of <text:bookmark-ref> element which is cross-referencing this paragraph.

The other reason aside of being a "aggregated view" is to group components loosely together, like for style formatting (i.e. >text:span> or the mentioned bookmark). These markers are no components itself.
Someone might argue, that by removing a paragraph the status of the document is being changed as well, which is true, similar to the deletion of all its children.

(D) A change to table cell content in a spreadsheet document might have influece on the content of other table cell's content which reference this table cell.

The connection of cells via formula is indeed an explicit exception. Still the modularity of a cell - being a component - was explicitly broken by referencing to it by an external formula, an explicit ODF mechanism.

(E) ...

I do not think that these examples will hinder us to define a <text:p> element as the root of a component.
But I think we need to 'tune' the "Component Search Criteria" to reflect such 'on the component's state depending XML changes'.

Thanks for your feed-back!
Svante

Mit freundlichen Grüßen / Best regards
Oliver-Rainer Wittmann

--
Advisory Software Engineer
-------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland
Beim Strohhause 17
20097 Hamburg
Phone: +49-40-6389-1415
E-Mail: orwitt@de.ibm.com
-------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland Research & Development GmbH / Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294

From: Svante Schubert <svante.schubert@gmail.com>
To: "office-collab@lists.oasis-open.org" <office-collab@lists.oasis-open.org>,
Date: 27.03.2013 11:33
Subject: [office-collab] Defining the Basics: The Search for Components
Sent by: <office-collab@lists.oasis-open.org>

The Search for Components
Search via Document Schemas
Instead of searching in every document at hand for components, the schema might be searched for components instead, as all given possibilities are automatically covered. An automated reading of the schema - perhaps with a visualization in a front-end to analyze the XML - might be very helpful as nowadays formats turn out to be quite complex¹. Optimal would be a web based application to be able to decentralize the work in sorting out the XML elements to components².

Component Search Criteria

A component is similar to a puzzle piece of a document, some logical unit, which consists of one or more XML elements, which are usually connected, but do not have to be (depends on the decision of the XML file format designer). The only rule is that the component have to be disjoint to other components. This means if the data or the state of the component is being changed, no other component’s data have to be changed (aside implicitly the parent). In other words by changing the components existing XML (element, attribute or text) or XML that is related to it, no other component as the containing component will change it state. The containing component changes its state as if for instance an image is being delete from the document, the document is changing as well, but no other component as other images, tables at a different place will change. Therefore if a component is being deleted, all XML (joint or spread over the XML file(s)) have to be deleted as a whole. Components usually have a specific XML element they start with “component root element”, like in ODF <text:p> for a paragraph. If the component may consists of multiple XML elements there are as well “component leave elements”. For instance, in ODF an image consists of the <draw:frame>, which provides the visual view size and the <draw:image> element containing the loadable graphic, while in HTML there is only a single <img/> element.

Often there is a lot of boilerplate XML elements in a format, which are not being mapped to a format. For instance, the components of an ODF text document are starting among <office:document>/<office:body>/<office:text>

All child elements of <office:text> are root components of the text document.

Similar to solving a Sudoku riddle it is best to solve the easy parts first and name the obvious components first. Aside of those root components, the components that are usually added by users via their applications are good starting points for an empiric approach.

When a component was found the “component root elements” (and in case of multi-element components either the ending “component leave elements” or if they are not easy to determine to mark the elements within the component named as “component trunk elements”) are best marked directly in the XML Schema. For instance in XML RelaxNG Schema using annotations³.

Referencing Components

A component within the component tree is referenced by its position. Similar to an URL position and identification should be the same. Components of all types (table, paragraph or character) should be handled equally when referenced by their position to allow an easy generic access. The root of the document would be “/” in the serialized string representing the position. All their children are counted by document order and representing by their document child position as an integer. For instance, the first component being a paragraph would be accessed via “/1”. The third character within this child paragraph would be accessed via “/1/3”. If there is a table after the paragraph, the fifth paragraph within the 4^th cell of the 3^rd row, would be accessed via “/2/3/4/5”.

Every component position can be mapped to its XML position.

Programming Guidance:
The creation of a specific component tree can be easily accomplished during the load of an XML document by implementing/overwriting the SAX ContentHandler interface. By overwriting the startElement, endElement and characters methods, all XML elements being component root elements, component delimiters and text can be gathered and mapped to operation calls (only sequential adding (e.g. no deletion, merge, split) during loading a document).

1 The document formats are very complex. The ODF 1.2 part 1 for instance counts about 600 XML elements and about 1300 XML attributes, not to mention the different attribute values possible, e.g. to express styles. See http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html

2The document formats are very complex. The ODF 1.2 part 1 for instance counts about 600 XML elements and about 1300 XML attributes, not to mention the different attribute values possible, e.g. to express styles. See http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html

3 http://relaxng.org/tutorial-20011203.html#IDA1OZR

Follow-Ups:
- RE: [office-collab] Defining the Basics: The Search for Components
  - From: John Haug <johnhaug@exchange.microsoft.com>

References:
- Defining the Basics: The Search for Components
  - From: Svante Schubert <svante.schubert@gmail.com>
- Re: [office-collab] Defining the Basics: The Search for Components
  - From: Oliver-Rainer Wittmann <ORWITT@de.ibm.com>