Defining the Basics: The Search for Components

Instead of searching in every document at hand for components, the schema might be searched for components instead, as all given possibilities are automatically covered. An automated reading of the schema - perhaps with a visualization in a front-end to analyze the XML - might be very helpful as nowadays formats turn out to be quite complex¹. Optimal would be a web based application to be able to decentralize the work in sorting out the XML elements to components².

Component Search Criteria

A component is similar to a puzzle piece of a document, some logical unit, which consists of one or more XML elements, which are usually connected, but do not have to be (depends on the decision of the XML file format designer). The only rule is that the component have to be disjoint to other components. This means if the data or the state of the component is being changed, no other component’s data have to be changed (aside implicitly the parent). In other words by changing the components existing XML (element, attribute or text) or XML that is related to it, no other component as the containing component will change it state. The containing component changes its state as if for instance an image is being delete from the document, the document is changing as well, but no other component as other images, tables at a different place will change. Therefore if a component is being deleted, all XML (joint or spread over the XML file(s)) have to be deleted as a whole. Components usually have a specific XML element they start with “component root element”, like in ODF <text:p> for a paragraph. If the component may consists of multiple XML elements there are as well “component leave elements”. For instance, in ODF an image consists of the <draw:frame>, which provides the visual view size and the <draw:image> element containing the loadable graphic, while in HTML there is only a single <img/> element.

Often there is a lot of boilerplate XML elements in a format, which are not being mapped to a format. For instance, the components of an ODF text document are starting among <office:document>/<office:body>/<office:text>

Similar to solving a Sudoku riddle it is best to solve the easy parts first and name the obvious components first. Aside of those root components, the components that are usually added by users via their applications are good starting points for an empiric approach.

When a component was found the “component root elements” (and in case of multi-element components either the ending “component leave elements” or if they are not easy to determine to mark the elements within the component named as “component trunk elements”) are best marked directly in the XML Schema. For instance in XML RelaxNG Schema using annotations³.

Referencing Components

A component within the component tree is referenced by its position. Similar to an URL position and identification should be the same. Components of all types (table, paragraph or character) should be handled equally when referenced by their position to allow an easy generic access. The root of the document would be “/” in the serialized string representing the position. All their children are counted by document order and representing by their document child position as an integer. For instance, the first component being a paragraph would be accessed via “/1”. The third character within this child paragraph would be accessed via “/1/3”. If there is a table after the paragraph, the fifth paragraph within the 4^th cell of the 3^rd row, would be accessed via “/2/3/4/5”.

Programming Guidance:
The creation of a specific component tree can be easily accomplished during the load of an XML document by implementing/overwriting the SAX ContentHandler interface. By overwriting the startElement, endElement and characters methods, all XML elements being component root elements, component delimiters and text can be gathered and mapped to operation calls (only sequential adding (e.g. no deletion, merge, split) during loading a document).

1 The document formats are very complex. The ODF 1.2 part 1 for instance counts about 600 XML elements and about 1300 XML attributes, not to mention the different attribute values possible, e.g. to express styles. See http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html

2The document formats are very complex. The ODF 1.2 part 1 for instance counts about 600 XML elements and about 1300 XML attributes, not to mention the different attribute values possible, e.g. to express styles. See http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html

office-collab message

The Search for Components

Search via Document Schemas

Component Search Criteria

Referencing Components