OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

docbook-apps message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]

Subject: RE: DOCBOOK-APPS: docbookXML conversion utility - looking for betasites

Hi Trevor,

Thanks for your note.

We are actually providing both 1) a set of "precise conversion" rules -
which imply that the document uses specific styles, and 2) a set of
"imprecise conversion" rules - which imply that you don't know what styles
or formatting the document is using (but that there are some).  We then use
"clues" to hone in the elements - including font-size, positioning, etc.

We find that though many Word users don't use styles- they still follow
certain conventions - chapter headings are bigger than body paragraphs,
captions, etc.  This applies to PDF files and to other files.

Of course, the precise rules will be more effective for a single document,
but the imprecise rules can be applied across many different types of
documents from many different sources and still give you a reasonable
accuracy (though not necessarily 100%, and of course there are always
documents that are very difficult).  The tool does 70%-80%, and provides an
easy to use UI to do the rest.

As I said, right now we're looking for sample users with real content (PDF,
HTML, Word, etc.)  to test the software out on for our beta cycle.


Riz Virk, (617) 905-3518
riz@xyztechnologies.com, riz@alum.mit.edu

-----Original Message-----
From: Trevor Jenkins [mailto:Trevor.Jenkins@suneidesis.com]
Sent: Friday, June 21, 2002 10:15 AM
To: Rizwan Virk
Cc: DocBook applications
Subject: Re: DOCBOOK-APPS: docbookXML conversion utility - looking for
beta sites

On Fri, 21 Jun 2002, Rizwan Virk wrote:

> While scanning the DocBook world, it seems that there is a lot of
> paid to how to take content in docbook and to transform it into other
> formats - text, HTML, PDF, etc., but not so much the other way around.

HTML might be amenable to such transformations, it is after all an
application of SGML/XML. But PDF is not so easy. Someone intent upon
preventing this backwards conversion could construct a pathological case
such that the text cannot be recovered in any sensible fashion.
Obfuscating the text flow by rearranging the drawing primitives so that
the text is not rendered sequentially will defeat all but the human eye.

> We're looking for people who want to convert existing content (PDF files,
> word files, text files, HTML files) into DocBook but don't want to do it
> manually and who don't already have appropriate tags on the content.

There are parallel problems with Word files. Users who do not use
stylesheets can construct documents that are difficult (maybe
impossible) to convert to SGML/XML. Where style nformation has been used
consistently then it is easier.

Regards, Trevor

British Sign Language is not inarticulate handwaving; it's a living
Support the campaign for formal recognition by the British government now!
Details at http://www.fdp.org.uk/


<>< Re: deemed!

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]

Powered by eList eXpress LLC