docstandards-interop-discuss message

Subject: RE: [docstandards-interop-discuss] Clarifications / Scope of the intendedwork?

From: Michael Priestley <mpriestl@ca.ibm.com>
To: "David RR Webber \(XML\)" <david@drrw.info>
Date: Tue, 10 Apr 2007 10:57:47 -0400

If we just used a presentation format for interchange, how would you preserve semantics, and how would you get a different look and feel?

For example, if I pull a DocBook procedure into a DITA web, I'd like to be able to:
a) identify it as an equivalent to a DITA task, so I can do things like sort related links appropriately, and
b) apply my own look-and-feel, including fonts, generated headings, headers/footers/standard navigation elements, etc.

There are several degrees of interoperability:
1) sharing content: pull the content into my deliverable, applying my own look and feel, navigation etc. - this is relatively simple, but already requires more than PDF as source
2) sharing semantics: pull the content into my production system, including specialized semantic processing for specialized elements - like treating task steps in a different way from generic list items
3) sharing constraints: provide equivalent constraints on both sides of the interchange, so that you can get robust integration of processes, and not break down every time someone feeds you a supposed "DITA task" that breaks the processing expectations by e.g. allowing multiple lists of steps under a single title, or more than one level of step nesting.

One of the proposals currently in place, including an argument for using an XML hub format for interchange with preservation of semantics, is here:
http://flatironssolutions.com/Downloads/DITA2007West.pdf - it provides a potential solution for 1) and 2); for 3), DITA has mechanisms for creating specialized content types that can match other existing standards while still processing as DITA content, which gives a potential solution for some cases at least.

Michael Priestley
IBM DITA Architect and Classification Schema PDT Lead
mpriestl@ca.ibm.com
http://dita.xml.org/blog/25

"David RR Webber \(XML\)" <david@drrw.info>

04/10/2007 10:41 AM

To	Michael Priestley/Toronto/IBM@IBMCA
cc	docstandards-interop-discuss@lists.oasis-open.org
Subject	RE: [docstandards-interop-discuss] Clarifications / Scope of the intended work?

Michael,

OK - then I believe the focus should be one level up. I'd postulate that content sharing has to be able to support document formats in a neutral way - a framework - rather than dictating one uber format or specific format - and then requiring transformation. From the human/business perspective - so long as the content can be presented consistently for human viewing / searching - the underlaying machine level stuff is immaterial.

What I had been talking to Adobe about is creating XML scripting for handling PDF attachments. Now PDF is an ISO submission - this opens up the way for that here.

The use case is from eGov - and the PDF is processed in several ways:

1) Checked to be valid PDF
- there's 100's of "flavours" of PDF - so check that its one you allow - e.g. reject if locked, not printable, editable, embedded graphics, wrong page size, no signature, wrong type of embedded notes, etc
- make sure its not corrupted and CRC etc OK.

2) Check PDF for content required items
- simple text headings and other content
- required bookmarks and links OK
- if using embedded XML for metacontent - make sure those are there
- graphics items
- page counts - total pages

3) Post-processing
- text extraction for knowledge mining
- re-packaging for review - combining with bookmarks, ToC, adding review pages, etc.
- add or remove XML metacontent, notes, other flags
- re-size and rotate graphics and content pages to make them standard orientation and sizes

Attached is a sample of this XML.

While all this is specific to PDF - and targetted at the iText OSS implementation initially - given that you can create the "iText" functional toolset to work against any target document format - Word, ODF, etc - I would suggest therefore that it would make sense to have the framework be there items:

1) Guidelines for document exchange - provides means to capture the who and the what - MoU / CPA level agreements
- can be both XML layout and / or document template.

2) Formal ability to express scripts that describes the content items, validations and checks and re-packaging occurring:
- sample for XML scripting to drive PDF receipt processing
- reverse scripting - template for generating document that will be filled in.

3) Formal set of document handling primatives to work with 2) that can be implemented for various document formats
- iText library good starting point for creating function set
- function set would be only a subset of these functions - aimed at exchange use case only
What this does therefore is allow exchanges to occur in a variety of document formats, both now, and into the future - but provides a common means to handle these, build them, and fill them in - regardless of the underlaying syntax of the documents themselves.

Now of course this is a MUCH bigger elephant! How much work does the TC want to chew off?

Conversely - you could view it the other way around - the PDF / XML approach is "low hanging fruit" - the OSS implementation exists with a large and active community - providing the XML handler there would be quick - and an implementation to support it simple.

Once that PDF use case is in place - then extend it out to ODF and Word next....by implementing the iText functional set for those formats too. This would then enable the third piece of course - transformation - by proxy! I could open a PDF in iText - call the ODF java functions to save it to ODF - but then that getting ahead of ourselves....

Thanks, DW

"The way to be is to do" - Confucius (551-472 B.C.)

-------- Original Message --------

Specifically we want to formalize mechanisms for exchanging content between organizations or applications that are using different XML document standards - so not PDF per se, but ODF, DITA, and DocBook, for a start, and hopefully others as we progress.

pdfGenXML-sample.xml

Follow-Ups:
- Re: [docstandards-interop-discuss] Clarifications / Scope of the intended work?
  - From: "Dave Pawson" <dave.pawson@gmail.com>

References:
- RE: [docstandards-interop-discuss] Clarifications / Scope of the intended work?
  - From: "David RR Webber \(XML\)" <david@drrw.info>