[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [docbook-apps] Preserving entities during SGML to XML transformation
Bernd Groh <bgroh@redhat.com> writes: > "hiding" the internal entities only may indeed be a good work-around, > though I'd need to implement a way that will survive a sgml2xml > conversion. I just took some time to test with both sgml2xml (James Clark's original "sx" from the SP suite) and osx (OpenSP version of sx). [Aside: Based on the limited testing I just did with it, osx seems to me to be massively broken in a wide variety of ways. I personally wouldn't trust anything to it.] I found that both sgml2xml and osx are broken in their handling of comments, which is what cloak was using to hide things. They don't always preserve comments where they should, even when I use -xcomment. So I just changed cloak so that it uses PIs to hide stuff instead. Updated version (v1.3) is at: http://docbook.sourceforge.net/outgoing/cloak It's also now showing up in ViewCVS: http://cvs.sourceforge.net/viewcvs.py/docbook/contrib/tools/cloak/ Though there will be a lag before v1.3 shows up there. Anyway, output from cloak should now survive an sgml2xml conversion without you needing to do any additional work. cloak is just a simple STDIN/STDOUT filter, so you can run it like this if you want: $ cat foo.xml | cloak | sgml2xml | cloak On my system, I made a symlink to cloak named "uncloak" and I use that for the uncloaking stage. (If cloak is called with the name "uncloak", it will do uncloaking. Otherwise, the way that cloak decides to do uncloaking instead of cloaking is it checks for a comment that it adds to the end of the file during the cloaking stage.) > I guess I'll be doing that now though. To be honest though, > I'm really close at attempting to write my own sgml2xml converter, since > I can hardly stand the resulting format. Neither can some of the tools. > Some of the output needs to be post-processed, in order to be useable > for, for example, po2xml. Then again, that may just be a bug in po2xml? Yeah, probably. As far as I know, po2xml does not use a real XML parser (e.g., expat, libxml2, xerces). It uses some kind of ad-hoc parser instead. So it may choke on some things that are actually XML compliant but that it just can't handle correctly. As far as I know, output from sgml2xml is XML compliant. > Is there any way to keep the format "as is", i.e. using the same > indentation and the same newlines at the same places, not putting, apart > from the nl-in-tag, everything in one (or a few) line(s)? Or is there > any other sgml to xml converter that would do that? There is no other free-software sgml to xml converter that I know of. And as far as I know, there is no way to prevent sgml2xml from changing your indentation and doing that nl-in-tag thing it does. Anyway, many (if not most) XML tools do not preserve newlines and indenting outside of text nodes. As far as the XML spec is concerned, they are not required to. So I'd suggest that you might want to add a post-processing stage to fix -- or at least "normalize" -- indenting and wrapping in your output. That tools I use for that are xmllint (with --format) and/or Paul DuBois's xmlformat: http://xmlhack.com/read.php?item=2154 So I would run the following to get "normalized" output of a transform through sgml2xml: $ cat foo.xml | cloak | sgml2xml | uncloak \ | xmllint --format | xmlformat xmlformat uses a config file that contains rules for how to handle specific elements. I have posted the DocBook-specific xmlformat.conf config file I use here: http://docbook.sourceforge.net/outgoing/xmlformat.conf I also checked it into the DocBook project contrib/tools area: http://cvs.sourceforge.net/viewcvs.py/docbook/contrib/tools/ One possible limitation in xmlformat is that it uses a funky REX-based[1] parser instead of a "real" XML parser (see comment about po2xml above). [1] http://www.cs.sfu.ca/~cameron/REX.html But despite that, I have never had problems with it choking on any documents or doing something unexpected to them. For me at least, it always works exactly as I'd expect. So I highly recommend at least trying it. It's the only tool I have ever found that does a complete & correct job of normalizing/pretty-printing XML content. --Mike
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]