docstandards-interop-discuss message

Subject: RE: [docstandards-interop-discuss] Clarifications / Scope of the intendedwork?

From: Michael Priestley <mpriestl@ca.ibm.com>
To: "David RR Webber \(XML\)" <david@drrw.info>
Date: Tue, 10 Apr 2007 11:20:29 -0400

Hi David,

So effectively what you're advocating is a hub format interchange, just like the Flatirons proposal, but with PDF with embedded XML metatagging, instead of XHTML with extra values in existing attributes.

What are the advantages of using PDF over XHTML?

Michael Priestley
IBM DITA Architect and Classification Schema PDT Lead
mpriestl@ca.ibm.com
http://dita.xml.org/blog/25

"David RR Webber \(XML\)" <david@drrw.info>

04/10/2007 11:10 AM

To	Michael Priestley/Toronto/IBM@IBMCA
cc	docstandards-interop-discuss@lists.oasis-open.org
Subject	RE: [docstandards-interop-discuss] Clarifications / Scope of the intended work?

Michael,

OK - I believe the function library approach easily handles the use case here. There's two XML-scripts - one is for new content assembly - your use case below - and the other is document post-processing and validation - my use case.

So - in your use case - repeated repurposing of content - the function library allows you to identify the content items - regardless of source (PDF, ODF, et al).

Here's how I'd see this working:

Step 1 - receive new content item / or retrieve content from repository
Step 2 - run validation script to verify that it is OK - has parts needed
Step 3 - run content creation script - extracts out parts from doc's - then applies new layoutting etc.

BTW - iText has all that "new layout and embedding" stuff in spades too - way more than is in the original PDF document - as you indicate - it is trivial to embed XML as well for DITA metatagging and so on into PDF doc's that you generate. No surprises there - PDF is an extremely rich and mature syntax. You can cram as much DITA as you like into a PDF using the meta XML support it has!

DW

"The way to be is to do" - Confucius (551-472 B.C.)

-------- Original Message --------
Subject: RE: [docstandards-interop-discuss] Clarifications / Scope of
the intended work?
From: Michael Priestley <mpriestl@ca.ibm.com>
Date: Tue, April 10, 2007 10:57 am
To: "David RR Webber (XML)" <david@drrw.info>
Cc: docstandards-interop-discuss@lists.oasis-open.org

If we just used a presentation format for interchange, how would you preserve semantics, and how would you get a different look and feel?

For example, if I pull a DocBook procedure into a DITA web, I'd like to be able to:
a) identify it as an equivalent to a DITA task, so I can do things like sort related links appropriately, and
b) apply my own look-and-feel, including fonts, generated headings,! headers/footers/standard navigation elements, etc.

There are several degrees of interoperability:
1) sharing content: pull the content into my deliverable, applying my own look and feel, navigation etc. - this is relatively simple, but already requires more than PDF as source
2) sharing semantics: pull the content into my production system, including specialized semantic processing for specialized elements - like treating task steps in a different way from generic list items
3) sharing constraints: provide equivalent constraints on both sides of the interchange, so that you can get robust integration of processes, and not break down every time someone feeds you a supposed "DITA task" that breaks the processing expectations by e.g. allowing multiple lists of steps under a single title, or more than one level of s! tep nesting.

One of the pr oposals currently in place, including an argument for using an XML hub format for interchange with preservation of semantics, is here:
http://flatironssolutions.com/Downloads/DITA2007West.pdf - it provides a potential solution for 1) and 2); for 3), DITA has mechanisms for creating specialized content types that can match other existing standards while still processing as DITA content, which gives a potential solution for some cases at least.

Michael Priestley
IBM DITA Architect and Classification Schema PDT Lead
mpriestl@ca.ibm.com
http://dita.xml.org/blog/25

"David RR Webber \(XML\)" <david@drrw.info>
04/10/2007 10:41 AM

To	Michael Priestley/Toronto/IBM@IBMCA
cc	docstandards-interop-discuss@lists.oasis-open.org
Subject	RE: [docstandards-interop-discuss] Clarifications / Scope of the intended work?

Michael,

OK - then I believe the focus should be one level up. I'd postulate that content sharing has to be able to support document formats in a neutral way - a framework - rather than dictating one uber format or specific format - and then requiring transformation. From the human/business perspective - so long as the content can be presented consistently for human viewing / searching - the underlaying machine level stuff is immaterial.

What I had been talking to Adobe about is creating XML scripting for handling PDF attachments. Now PDF is an ISO submission - this opens up the way for that here.

The use case is from eGov - and the PDF is processed in several ways:

1! ) Checked to be valid PDF
- there's 100's of "flavours" of PDF - so check that its one you allow - e.g. reject if locked, not printable, editable, embedded graphics, wrong page size, no signature, wrong type of embedded notes, etc
- make sure its not corrupted and CRC etc OK.

2) Check PDF for content required items
- simple text headings and other content
- required bookmarks and links OK
- if using embedded XML for metacontent - make sure those are there
- graphics items
- page counts - total pages

3) Post-processing
- text extraction for knowledge mining
- re-packaging for review - combining with bo okmarks, ToC, adding review pages, etc.
- add or remove XML metacontent, notes, other flags
- re-size and rotate graphics and content pages to make them standard orientation and sizes

Attached is a sample of this XML.

While all this is specific to PDF - and targetted at the iText OSS implementation initially - given that you can create the "iText" functional toolset to work against any target document format - Word, ODF, etc - I would suggest therefore that it would make sense to have the framework be there items:

1) Guidelines for document exchange - provides means to capture the who and the what - MoU / CPA level agreements
- can be both XML layout and / or document template.

2) Formal ability to express scripts that describes the content items, validations and checks and re-packaging occurring:
- sample for XML scripting to drive PDF receipt processing
- reverse scripting - template for generating document that will be filled in.

3) Formal set of document handling primatives to work with 2) that can be implemented for various document formats
- iText library good starting point for creating function set
- function set would be only a subset of these functions - aimed at exchange use case only
What this does therefore is allow exchanges to occur in a variety of document formats, both now, and into the future - but provides a common means to handle these, build them, and fill them in - regard! less of the underlaying syntax of the documents themselves.

Now of course this is a MUCH bigger elephant! How much work does the TC want to chew off?

Conversely - you could view it the other way around - the PDF / XML approach is "low hanging fruit" - the OSS implementation exists with a large and active community - providing the XML handler there would be quick - and an implementation to support it simple.

Once that PDF use case is in place - then extend it out to ODF and Word next....by implementing the iText functional set for those formats too. This would then enable the third piece of course - transformation - by proxy! I could open a PDF in iText - call the ODF java functions to save it to ODF - but then that getting ahead of ourselves....

Thanks, DW

"The way to be is to do" -! Confucius (551-472 B.C.)

-------- Original Message --------

Specifically we want to formalize mechanisms for exchanging content between organizations or applications that are using different XML document standards - so not PDF per se, but ODF, DITA, and DocBook, for a start, and hopefully others as we progress.

--------------------------------------------------------------------- To unsubscribe, e-mail:docstandards-interop-discuss-unsubscribe@lists.oasis-open.orgFor additional commands, e-mail:docstandards-interop-discuss-help@lists.oasis-open.org
--------------------------------------------------------------------- To unsubscribe, e-mail: docstandards-interop-discuss-unsubscribe@lists.oasis-open.org For additional commands, e-mail: docstandards-interop-discuss-help@lists.oasis-open.org

References:
- RE: [docstandards-interop-discuss] Clarifications / Scope of the intended work?
  - From: "David RR Webber \(XML\)" <david@drrw.info>