docbook-apps message

Subject: RE: DOCBOOK-APPS: From RTF to DocBook
From: Jeff Beal <jeff.beal@ansys.com>
To: docbook-apps@lists.oasis-open.org
Date: Fri, 12 Jul 2002 08:55:04 -0400
Using Tidy, you can convert "dirty HTML" into "clean XHTML" and then do XSLT
transformations on the XHTML.  There's an example stylesheet on the DocBook
Wiki on converting XHTML into DocBook which can be a starting point.  As
Petr said, there's always a lot of handwork involved when converting from
visual markup into DocBook.

Jeff

-----Original Message-----
From: Prikryl,Petr [mailto:PRIKRYLP@skil.cz]
Sent: Friday, July 12, 2002 8:23 AM
To: kangoo@tiscali.fr; docbook-apps@lists.oasis-open.org
Subject: RE: DOCBOOK-APPS: From RTF to DocBook


Sebastien wrote...

> I am looking for a good tool able to convert from Doc/RTF
> to DocBook if possible. I investigated and found Majix
> which converts to Simplified DocBook but the conversion
> does not seem to be quite good a support has done been
> given to the product since 1999. Therefore no RTF 1.6
> support.
> 
> I found UpCast which converts to an intermediary format
> and after I have to go through XSLT transformations. The
> conversion suits me, but I would like to know if there is
> a good product (preferably Open Source) able to convert
> from RTF to DocBook without big losses of information.
> [...] 

I doubt that there is a really good tool that produces good
DocBook sources from general Doc/RTF documents. The problem
is that Doc/RTF is rather visual-markup oriented while
DocBook is very structural-markup oriented. The conversion
from visual to structural can always be only a guess (if
there are not some very strict rules for the visual markup).

For that reason I think that there always be a lot of hand
work when cleaning up the source (pick a good editor with
regular expressions).

For the first transformation of Doc/RTF, I tried to export
to HTML (directly from MS Word).  The produced HTML is
extremely ugly and really cripled.  But there is the "tidy"
utility (mentioned on W3C main page and at with home at
http://tidy.sourceforge.net/). The tidy is able to convert
the cripled HTML into the excelent one with CSS classes used
instead of all the <FONT ...> etc.  Then, using a good
editor, I would clean then the HTML from visual into structural
markup. Then I would convert the HTML into XML (DocBook).

A wild guess: if I remember well, the xsltproc is able to read
HTML -- so you could do some XSLT transformations of the
cleaned HTML (I have no experience with this).

Regards,
  Petr