[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Subject: RE: DOCBOOK-APPS: docbookXML conversion utility - looking for betasites
Hi Trevor, Thanks for your note. We are actually providing both 1) a set of "precise conversion" rules - which imply that the document uses specific styles, and 2) a set of "imprecise conversion" rules - which imply that you don't know what styles or formatting the document is using (but that there are some). We then use "clues" to hone in the elements - including font-size, positioning, etc. We find that though many Word users don't use styles- they still follow certain conventions - chapter headings are bigger than body paragraphs, captions, etc. This applies to PDF files and to other files. Of course, the precise rules will be more effective for a single document, but the imprecise rules can be applied across many different types of documents from many different sources and still give you a reasonable accuracy (though not necessarily 100%, and of course there are always documents that are very difficult). The tool does 70%-80%, and provides an easy to use UI to do the rest. As I said, right now we're looking for sample users with real content (PDF, HTML, Word, etc.) to test the software out on for our beta cycle. Thanks, Riz ------------------------------ Riz Virk, (617) 905-3518 firstname.lastname@example.org, email@example.com http://www.xyztechnologies.com -----Original Message----- From: Trevor Jenkins [mailto:Trevor.Jenkins@suneidesis.com] Sent: Friday, June 21, 2002 10:15 AM To: Rizwan Virk Cc: DocBook applications Subject: Re: DOCBOOK-APPS: docbookXML conversion utility - looking for beta sites On Fri, 21 Jun 2002, Rizwan Virk wrote: > While scanning the DocBook world, it seems that there is a lot of attention > paid to how to take content in docbook and to transform it into other > formats - text, HTML, PDF, etc., but not so much the other way around. HTML might be amenable to such transformations, it is after all an application of SGML/XML. But PDF is not so easy. Someone intent upon preventing this backwards conversion could construct a pathological case such that the text cannot be recovered in any sensible fashion. Obfuscating the text flow by rearranging the drawing primitives so that the text is not rendered sequentially will defeat all but the human eye. > We're looking for people who want to convert existing content (PDF files, > word files, text files, HTML files) into DocBook but don't want to do it > manually and who don't already have appropriate tags on the content. There are parallel problems with Word files. Users who do not use stylesheets can construct documents that are difficult (maybe impossible) to convert to SGML/XML. Where style nformation has been used consistently then it is easier. Regards, Trevor British Sign Language is not inarticulate handwaving; it's a living language. Support the campaign for formal recognition by the British government now! Details at http://www.fdp.org.uk/ -- <>< Re: deemed!
Powered by eList eXpress LLC