Mit freundlichen Grüßen / Best regards
Oliver-Rainer Wittmann
--
Advisory Software Engineer
-------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland
Beim Strohhause 17
20097 Hamburg
Phone: +49-40-6389-1415
E-Mail: orwitt@de.ibm.com
-------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland Research & Development GmbH / Vorsitzende
des Aufsichtsrats: Martina Koederitz
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht
Stuttgart,
HRB 243294
From:
Svante Schubert
<svante.schubert@gmail.com>
To:
"office-collab@lists.oasis-open.org"
<office-collab@lists.oasis-open.org>,
Date:
27.03.2013 11:33
Subject:
[office-collab]
Defining the Basics: The Search for Components
Sent by:
<office-collab@lists.oasis-open.org>
The Search for Components
Search via Document Schemas
Instead of searching in every document at hand
for components,
the schema might be searched for components instead, as all
given possibilities
are automatically covered. An automated reading of the schema
- perhaps
with a visualization in a front-end to analyze the XML - might
be very
helpful as nowadays formats turn out to be quite complex1.
Optimal would be a web based application to be able to
decentralize the
work in sorting out the XML elements to components2.
Component Search Criteria
A component is similar to a puzzle piece of a
document,
some logical unit, which consists of one or more XML elements,
which are
usually connected, but do not have to be (depends on the
decision of the
XML file format designer). The only rule is that the component
have to
be disjoint to other components. This means if the data or the
state of
the component is being changed, no other component’s data have
to be changed
(aside implicitly the parent). In other words by changing the
components
existing XML (element, attribute or text) or XML that is
related to it,
no other component as the containing component will change it
state. The
containing component changes its state as if for instance an
image is being
delete from the document, the document is changing as well,
but no other
component as other images, tables at a different place will
change. Therefore
if a component is being deleted, all XML (joint or spread over
the XML
file(s)) have to be deleted as a whole. Components usually
have a specific
XML element they start with “component root element”, like in
ODF <text:p>
for a paragraph. If the component may consists of multiple XML
elements
there are as well “component leave elements”. For instance, in
ODF an
image consists of the <draw:frame>, which provides the
visual view
size and the <draw:image> element containing the
loadable graphic,
while in HTML there is only a single <img/> element.
Often there is a lot of boilerplate XML elements
in a format,
which are not being mapped to a format. For instance, the
components of
an ODF text document are starting among
<office:document>/<office:body>/<office:text>
All child elements of <office:text> are
root components
of the text document.
Similar to solving a Sudoku riddle it is best to
solve
the easy parts first and name the obvious components first.
Aside of those
root components, the components that are usually added by
users via their
applications are good starting points for an empiric approach.
When a component was found the “component root
elements”
(and in case of multi-element components either the ending
“component
leave elements” or if they are not easy to determine to mark
the elements
within the component named as “component trunk elements”) are
best marked
directly in the XML Schema. For instance in XML RelaxNG Schema
using annotations3.
Referencing Components
A component within the component tree is
referenced by
its position. Similar to an URL position and identification
should be the
same. Components of all types (table, paragraph or character)
should be
handled equally when referenced by their position to allow an
easy generic
access. The root of the document would be “/” in the
serialized string
representing the position. All their children are counted by
document order
and representing by their document child position as an
integer. For instance,
the first component being a paragraph would be accessed via
“/1”. The
third character within this child paragraph would be accessed
via “/1/3”.
If there is a table after the paragraph, the fifth paragraph
within the
4th cell of the 3rd row, would be
accessed via “/2/3/4/5”.
Every component position can be mapped to its
XML position.
Programming Guidance:
The creation of a specific component tree can be easily
accomplished during
the load of an XML document by implementing/overwriting the SAX
ContentHandler interface. By
overwriting the
startElement, endElement and characters methods, all XML
elements being
component root elements, component delimiters and text can be
gathered
and mapped to operation calls (only sequential adding (e.g. no
deletion,
merge, split) during loading a document).
1
The document formats are very complex. The ODF 1.2 part 1 for
instance
counts about 600 XML elements and about 1300 XML attributes,
not to mention
the different attribute values possible, e.g. to express
styles. See http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html
2The
document formats are very complex. The ODF 1.2 part 1 for
instance counts
about 600 XML elements and about 1300 XML attributes, not to
mention the
different attribute values possible, e.g. to express styles.
See http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part1.html
3 http://relaxng.org/tutorial-20011203.html#IDA1OZR