docbook-apps message

Subject: Re: [docbook-apps] Preserving entities during SGML to XML transformation

From: Bernd Groh <bgroh@redhat.com>
To: Michael Smith <smith@xml-doc.org>
Date: Thu, 16 Jun 2005 08:23:19 +1000

Michael Smith wrote:

>Bernd Groh <bgroh@redhat.com> writes:
>
>  
>
>>"hiding" the internal entities only may indeed be a good work-around, 
>>though I'd need to implement a way that will survive a sgml2xml 
>>conversion.
>>    
>>
>
>I just took some time to test with both sgml2xml (James Clark's
>original "sx" from the SP suite) and osx (OpenSP version of sx).
>
>[Aside: Based on the limited testing I just did with it, osx seems
>to me to be massively broken in a wide variety of ways. I
>personally wouldn't trust anything to it.]
>
>I found that both sgml2xml and osx are broken in their handling of
>comments, which is what cloak was using to hide things. They don't
>always preserve comments where they should, even when I use -xcomment.
>
>So I just changed cloak so that it uses PIs to hide stuff instead.
>Updated version (v1.3) is at:
>
>  http://docbook.sourceforge.net/outgoing/cloak
>
>It's also now showing up in ViewCVS:
>
>  http://cvs.sourceforge.net/viewcvs.py/docbook/contrib/tools/cloak/
>
>Though there will be a lag before v1.3 shows up there.
>
>Anyway, output from cloak should now survive an sgml2xml
>conversion without you needing to do any additional work.
>
>cloak is just a simple STDIN/STDOUT filter, so you can run it like
>this if you want:
>
>  $ cat foo.xml | cloak | sgml2xml | cloak
>
>On my system, I made a symlink to cloak named "uncloak" and I use
>that for the uncloaking stage.
>
>(If cloak is called with the name "uncloak", it will do
>uncloaking. Otherwise, the way that cloak decides to do uncloaking
>instead of cloaking is it checks for a comment that it adds to the
>end of the file during the cloaking stage.)
>  
>

Thanks. I quickly wrote a perl-script yesterday, that does its own cloak 
of all non-external entities. I found that this avoids running into 
certain "character data not allowed here" errors. Now all remaining 
errors are undefined references.

>>I guess I'll be doing that now though. To be honest though, 
>>I'm really close at attempting to write my own sgml2xml converter, since 
>>I can hardly stand the resulting format. Neither can some of the tools. 
>>Some of the output needs to be post-processed, in order to be useable 
>>for, for example, po2xml. Then again, that may just be a bug in po2xml? 
>>    
>>
>
>Yeah, probably. As far as I know, po2xml does not use a real XML
>parser (e.g., expat, libxml2, xerces). It uses some kind of ad-hoc
>parser instead. So it may choke on some things that are actually
>XML compliant but that it just can't handle correctly.
>  
>

Seems like. Some really good ones:

<tag .......></tag> doesn't seem to work, only <tag ......./> will do.
entities which are referencing files, e.g. &includedfile; need to be in 
a separate line.
<para>&includedfile; will cause an error.

>As far as I know, output from sgml2xml is XML compliant.
>
>  
>
>>Is there any way to keep the format "as is", i.e. using the same 
>>indentation and the same newlines at the same places, not putting, apart 
>>from the nl-in-tag, everything in one (or a few) line(s)? Or is there 
>>any other sgml to xml converter that would do that?
>>    
>>
>
>There is no other free-software sgml to xml converter that I know
>of. And as far as I know, there is no way to prevent sgml2xml from
>changing your indentation and doing that nl-in-tag thing it does.
>
>Anyway, many (if not most) XML tools do not preserve newlines and
>indenting outside of text nodes. As far as the XML spec is
>concerned, they are not required to.
>
>So I'd suggest that you might want to add a post-processing stage
>to fix -- or at least "normalize" -- indenting and wrapping in
>your output.
>
>That tools I use for that are xmllint (with --format) and/or Paul
>DuBois's xmlformat:
>
>  http://xmlhack.com/read.php?item=2154
>
>So I would run the following to get "normalized" output of a
>transform through sgml2xml:
>
>  $ cat foo.xml | cloak | sgml2xml | uncloak \
>    | xmllint --format | xmlformat
>
>xmlformat uses a config file that contains rules for how to handle
>specific elements. I have posted the DocBook-specific
>xmlformat.conf config file I use here:
>
>  http://docbook.sourceforge.net/outgoing/xmlformat.conf
>
>I also checked it into the DocBook project contrib/tools area:
>
>  http://cvs.sourceforge.net/viewcvs.py/docbook/contrib/tools/
>
>One possible limitation in xmlformat is that it uses a funky
>REX-based[1] parser instead of a "real" XML parser (see comment
>about po2xml above).
>
>  [1] http://www.cs.sfu.ca/~cameron/REX.html
>
>But despite that, I have never had problems with it choking on any
>documents or doing something unexpected to them. For me at least,
>it always works exactly as I'd expect. So I highly recommend at
>least trying it. It's the only tool I have ever found that does a
>complete & correct job of normalizing/pretty-printing XML content.
>  
>

Thanks a lot for all that info!!

Bernd

References:
- sgml2xml (osx) Segmentation fault
  - From: Bernd Groh <bgroh@redhat.com>
- Preserving entities during SGML to XML transformation
  - From: Michael Smith <smith@xml-doc.org>
- Re: [docbook-apps] Preserving entities during SGML to XML transformation
  - From: Bernd Groh <bgroh@redhat.com>
- Re: [docbook-apps] Preserving entities during SGML to XML transformation
  - From: Michael Smith <smith@xml-doc.org>