docstandards-interop-discuss message

Subject: RE: [docstandards-interop-discuss] Clarifications / Scope of the intended work?

From: "David RR Webber \(XML\)" <david@drrw.info>
To: Michael Priestley <mpriestl@ca.ibm.com>
Date: Tue, 10 Apr 2007 07:41:39 -0700

Michael,

OK - then I believe the focus should be one level up. I'd postulate that content sharing has to be able to support document formats in a neutral way - a framework - rather than dictating one uber format or specific format - and then requiring transformation. From the human/business perspective - so long as the content can be presented consistently for human viewing / searching - the underlaying machine level stuff is immaterial.

What I had been talking to Adobe about is creating XML scripting for handling PDF attachments. Now PDF is an ISO submission - this opens up the way for that here.

The use case is from eGov - and the PDF is processed in several ways:

1) Checked to be valid PDF

- there's 100's of "flavours" of PDF - so check that its one you allow - e.g. reject if locked, not printable, editable, embedded graphics, wrong page size, no signature, wrong type of embedded notes, etc

- make sure its not corrupted and CRC etc OK.

2) Check PDF for content required items

- simple text headings and other content

- required bookmarks and links OK

- if using embedded XML for metacontent - make sure those are there

- graphics items

- page counts - total pages

3) Post-processing

- text extraction for knowledge mining

- re-packaging for review - combining with bookmarks, ToC, adding review pages, etc.

- add or remove XML metacontent, notes, other flags

- re-size and rotate graphics and content pages to make them standard orientation and sizes

Attached is a sample of this XML.

While all this is specific to PDF - and targetted at the iText OSS implementation initially - given that you can create the "iText" functional toolset to work against any target document format - Word, ODF, etc - I would suggest therefore that it would make sense to have the framework be there items:

1) Guidelines for document exchange - provides means to capture the who and the what - MoU / CPA level agreements

- can be both XML layout and / or document template.

2) Formal ability to express scripts that describes the content items, validations and checks and re-packaging occurring:

- sample for XML scripting to drive PDF receipt processing

- reverse scripting - template for generating document that will be filled in.

3) Formal set of document handling primatives to work with 2) that can be implemented for various document formats

- iText library good starting point for creating function set

- function set would be only a subset of these functions - aimed at exchange use case only

What this does therefore is allow exchanges to occur in a variety of document formats, both now, and into the future - but provides a common means to handle these, build them, and fill them in - regardless of the underlaying syntax of the documents themselves.

Now of course this is a MUCH bigger elephant! How much work does the TC want to chew off?

Conversely - you could view it the other way around - the PDF / XML approach is "low hanging fruit" - the OSS implementation exists with a large and active community - providing the XML handler there would be quick - and an implementation to support it simple.

Once that PDF use case is in place - then extend it out to ODF and Word next....by implementing the iText functional set for those formats too. This would then enable the third piece of course - transformation - by proxy! I could open a PDF in iText - call the ODF java functions to save it to ODF - but then that getting ahead of ourselves....

Thanks, DW

"The way to be is to do" - Confucius (551-472 B.C.)

-------- Original Message --------

Specifically we want to formalize mechanisms for exchanging content between organizations or applications that are using different XML document standards - so not PDF per se, but ODF, DITA, and DocBook, for a start, and hopefully others as we progress.

<?xml version="1.0" encoding="UTF-8"?>
<pdfGenXML xmlns:xmp="http://www.adobe.com/xmp";>
  <pdfHeader>
    <!-- This allows setting of various properties for the PDF document -->
    <pdfSettings>
     <pdfSet property="PageSize" value="8.5x11"/>
     <pdfSet property="DPI" value="72"/>
    </pdfSettings>
    <!-- Also XMP metadata tags -->
   <pdfXMP>
      <xmp:Stuff/>
   </pdfXMP>
  </pdfHeader>
  <pdfContent sourceURL='c:\samples\content\docs1'>
     <pdfDefaults>
       <pdfPgHdr lines="2" text="A sample generated PDF //@date()//" align="middle"/>
       <pdfPgFtr lines="2" text="Copyright //@char('#1234')// OASIS pdfGen TC - Page //@page()//" align="left"/>
     </pdfDefaults>
     <pdfPage>
        <pdfSuppress Hdr="true" Ftr="true"/>
        <pdfText syntax="HTML" font="Times Roman" size="3">
          <br/><h1>Our Sample Document</h1>
          <br/><br/><h3>Generated using pdfGen XML scripting</h3>
        </pdfText>
        <pdfBarCode style="3of9" rotated="no" position="10,5" startvalue="120055544"/>
     </pdfPage>
     <pdfPage>
       <pdfTOC style="default">
         <pdfBookMark name="Chapter 1"/>
         <pdfBookMark name="Chapter 2"/>
         <pdfBookMark name="Chapter 3"/>
         <pdfBookMark name="Chapter 4"/>
       </pdfTOC>
     </pdfPage>
     <pdfPage>
         <pdfSetBookMark name="Chapter 1"/>
         <pdfInsert type="PDF" sourceDOC='..\mydoc1.pdf' scaleContent="false"/>
         <pdfValidation>
           <pdfCheck condition="pageCount" max="1" severity="warn">WARNING: //@sourceDOC() page count more than one.</pdfCheck>
           <pdfCheck condition="required" severity="error">ERROR: //@sourceDOC() missing.</pdfCheck>
         </pdfValidation>
     </pdfPage>
     <pdfPage>
         <pdfSetBookMark name="Chapter 2"/>
          <pdfText space="preserve" syntax="text" font="Times Roman" size="3">
  Sample Image of Blocked Pipe

          </pdfText>
         <pdfInsert type="JPG" sourceDOC="..\mypic1.jpg" scaleContent="fitToPage"/>
     </pdfPage>
     <pdfPage>
         <pdfInsert type="PDF" sourceDOC="..\mydoc2.pdf" scaleContent="false" editable="flatten" preserveNotes="true"/>
     </pdfPage>
     <pdfPage>
         <pdfSetBookMark name="Chapter 3"/>
         <pdfInsert type="PDF" sourceDOC="..\mydoc3.pdf" scaleContent="adjustPageSize" landscape="rotate"/>
         <pdfValidation>
           <pdfCheck condition="contains" value="Introduction to PDF handling">WARNING: //@sourceDOC() missing topic - 'Introduction to PDF handling'.</pdfCheck>
           <pdfCheck condition="required" severity="error">ERROR: //@sourceDOC() missing.</pdfCheck>
         </pdfValidation>
     </pdfPage>
     <pdfPage>
         <pdfSetBookMark name="Chapter 4"/>
         <pdfInsert type="XFO" sourceDOC="..\mydoc3.xml" stylesheet="..\page-layout.xsl"/>
     </pdfPage>
  </pdfContent>
  <pdfOnError>
    <pdfIfError>
     <pdfPage>
        <pdfText space="preserve" syntax="HTML" font="Times Roman" size="3">
          <br/><h1>ERROR OCCURRED:</h1>
          <br/><br/><h3>Generation failed - reason:</h3><br/>
          &lt;pre&gt;
        </pdfText>
        <pdfErrorText/>
        <pdfText space="preserve" syntax="HTML" font="Times Roman" size="3">
          &lt;/pre&gt;
        </pdfText>
     </pdfPage>
    </pdfIfError>
     <pdfReport method="REST" targetURL="my.webservice.com:8044/catchit" syntax="http">
       <pdfPage>
        <pdfIfError>
         <pdfText space="preserve" syntax="HTML" font="Times Roman" size="3">
          <br/><h1>ERROR OCCURRED:</h1>
          <br/><br/><h3>Generation failed - reason:</h3><br/>
          &lt;pre&gt;
         </pdfText>
        </pdfIfError>

        <pdfIfWarn>
         <pdfText space="preserve" syntax="HTML" font="Times Roman" size="3">
          <br/><h1>Warning:</h1>
          <br/><br/><h3>Invalid content - reason:</h3><br/>
          &lt;pre&gt;
         </pdfText>
        </pdfIfWarn>

        <pdfErrorText/>
        <pdfText space="preserve" syntax="HTML" font="Times Roman" size="3">
          &lt;/pre&gt;
        </pdfText>
       </pdfPage>
     </pdfReport>
  </pdfOnError>
</pdfGenXML>

Follow-Ups:
- RE: [docstandards-interop-discuss] Clarifications / Scope of the intendedwork?
  - From: Michael Priestley <mpriestl@ca.ibm.com>