docbook-apps message

Subject: Re: [docbook-apps] Preserving entities during SGML to XML transformation

From: Michael Smith <smith@xml-doc.org>
To: Bernd Groh <bgroh@redhat.com>
Date: Wed, 15 Jun 2005 21:46:11 +0900

Bernd Groh <bgroh@redhat.com> writes:

> "hiding" the internal entities only may indeed be a good work-around, 
> though I'd need to implement a way that will survive a sgml2xml 
> conversion.

I just took some time to test with both sgml2xml (James Clark's
original "sx" from the SP suite) and osx (OpenSP version of sx).

[Aside: Based on the limited testing I just did with it, osx seems
to me to be massively broken in a wide variety of ways. I
personally wouldn't trust anything to it.]

I found that both sgml2xml and osx are broken in their handling of
comments, which is what cloak was using to hide things. They don't
always preserve comments where they should, even when I use -xcomment.

So I just changed cloak so that it uses PIs to hide stuff instead.
Updated version (v1.3) is at:

  http://docbook.sourceforge.net/outgoing/cloak

It's also now showing up in ViewCVS:

  http://cvs.sourceforge.net/viewcvs.py/docbook/contrib/tools/cloak/

Though there will be a lag before v1.3 shows up there.

Anyway, output from cloak should now survive an sgml2xml
conversion without you needing to do any additional work.

cloak is just a simple STDIN/STDOUT filter, so you can run it like
this if you want:

  $ cat foo.xml | cloak | sgml2xml | cloak

On my system, I made a symlink to cloak named "uncloak" and I use
that for the uncloaking stage.

(If cloak is called with the name "uncloak", it will do
uncloaking. Otherwise, the way that cloak decides to do uncloaking
instead of cloaking is it checks for a comment that it adds to the
end of the file during the cloaking stage.)

> I guess I'll be doing that now though. To be honest though, 
> I'm really close at attempting to write my own sgml2xml converter, since 
> I can hardly stand the resulting format. Neither can some of the tools. 
> Some of the output needs to be post-processed, in order to be useable 
> for, for example, po2xml. Then again, that may just be a bug in po2xml? 

Yeah, probably. As far as I know, po2xml does not use a real XML
parser (e.g., expat, libxml2, xerces). It uses some kind of ad-hoc
parser instead. So it may choke on some things that are actually
XML compliant but that it just can't handle correctly.

As far as I know, output from sgml2xml is XML compliant.

> Is there any way to keep the format "as is", i.e. using the same 
> indentation and the same newlines at the same places, not putting, apart 
> from the nl-in-tag, everything in one (or a few) line(s)? Or is there 
> any other sgml to xml converter that would do that?

There is no other free-software sgml to xml converter that I know
of. And as far as I know, there is no way to prevent sgml2xml from
changing your indentation and doing that nl-in-tag thing it does.

Anyway, many (if not most) XML tools do not preserve newlines and
indenting outside of text nodes. As far as the XML spec is
concerned, they are not required to.

So I'd suggest that you might want to add a post-processing stage
to fix -- or at least "normalize" -- indenting and wrapping in
your output.

That tools I use for that are xmllint (with --format) and/or Paul
DuBois's xmlformat:

  http://xmlhack.com/read.php?item=2154

So I would run the following to get "normalized" output of a
transform through sgml2xml:

  $ cat foo.xml | cloak | sgml2xml | uncloak \
    | xmllint --format | xmlformat

xmlformat uses a config file that contains rules for how to handle
specific elements. I have posted the DocBook-specific
xmlformat.conf config file I use here:

  http://docbook.sourceforge.net/outgoing/xmlformat.conf

I also checked it into the DocBook project contrib/tools area:

  http://cvs.sourceforge.net/viewcvs.py/docbook/contrib/tools/

One possible limitation in xmlformat is that it uses a funky
REX-based[1] parser instead of a "real" XML parser (see comment
about po2xml above).

  [1] http://www.cs.sfu.ca/~cameron/REX.html

But despite that, I have never had problems with it choking on any
documents or doing something unexpected to them. For me at least,
it always works exactly as I'd expect. So I highly recommend at
least trying it. It's the only tool I have ever found that does a
complete & correct job of normalizing/pretty-printing XML content.

   --Mike

smime.p7s

Follow-Ups:
- Re: [docbook-apps] Preserving entities during SGML to XML transformation
  - From: Bernd Groh <bgroh@redhat.com>

References:
- sgml2xml (osx) Segmentation fault
  - From: Bernd Groh <bgroh@redhat.com>
- Preserving entities during SGML to XML transformation
  - From: Michael Smith <smith@xml-doc.org>
- Re: [docbook-apps] Preserving entities during SGML to XML transformation
  - From: Bernd Groh <bgroh@redhat.com>