[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [docbook-apps] Preserving entities during SGML to XML transformation
Michael Smith wrote: >Bernd Groh <bgroh@redhat.com> writes: > > > >>"hiding" the internal entities only may indeed be a good work-around, >>though I'd need to implement a way that will survive a sgml2xml >>conversion. >> >> > >I just took some time to test with both sgml2xml (James Clark's >original "sx" from the SP suite) and osx (OpenSP version of sx). > >[Aside: Based on the limited testing I just did with it, osx seems >to me to be massively broken in a wide variety of ways. I >personally wouldn't trust anything to it.] > >I found that both sgml2xml and osx are broken in their handling of >comments, which is what cloak was using to hide things. They don't >always preserve comments where they should, even when I use -xcomment. > >So I just changed cloak so that it uses PIs to hide stuff instead. >Updated version (v1.3) is at: > > http://docbook.sourceforge.net/outgoing/cloak > >It's also now showing up in ViewCVS: > > http://cvs.sourceforge.net/viewcvs.py/docbook/contrib/tools/cloak/ > >Though there will be a lag before v1.3 shows up there. > >Anyway, output from cloak should now survive an sgml2xml >conversion without you needing to do any additional work. > >cloak is just a simple STDIN/STDOUT filter, so you can run it like >this if you want: > > $ cat foo.xml | cloak | sgml2xml | cloak > >On my system, I made a symlink to cloak named "uncloak" and I use >that for the uncloaking stage. > >(If cloak is called with the name "uncloak", it will do >uncloaking. Otherwise, the way that cloak decides to do uncloaking >instead of cloaking is it checks for a comment that it adds to the >end of the file during the cloaking stage.) > > Thanks. I quickly wrote a perl-script yesterday, that does its own cloak of all non-external entities. I found that this avoids running into certain "character data not allowed here" errors. Now all remaining errors are undefined references. >>I guess I'll be doing that now though. To be honest though, >>I'm really close at attempting to write my own sgml2xml converter, since >>I can hardly stand the resulting format. Neither can some of the tools. >>Some of the output needs to be post-processed, in order to be useable >>for, for example, po2xml. Then again, that may just be a bug in po2xml? >> >> > >Yeah, probably. As far as I know, po2xml does not use a real XML >parser (e.g., expat, libxml2, xerces). It uses some kind of ad-hoc >parser instead. So it may choke on some things that are actually >XML compliant but that it just can't handle correctly. > > Seems like. Some really good ones: <tag .......></tag> doesn't seem to work, only <tag ......./> will do. entities which are referencing files, e.g. &includedfile; need to be in a separate line. <para>&includedfile; will cause an error. >As far as I know, output from sgml2xml is XML compliant. > > > >>Is there any way to keep the format "as is", i.e. using the same >>indentation and the same newlines at the same places, not putting, apart >>from the nl-in-tag, everything in one (or a few) line(s)? Or is there >>any other sgml to xml converter that would do that? >> >> > >There is no other free-software sgml to xml converter that I know >of. And as far as I know, there is no way to prevent sgml2xml from >changing your indentation and doing that nl-in-tag thing it does. > >Anyway, many (if not most) XML tools do not preserve newlines and >indenting outside of text nodes. As far as the XML spec is >concerned, they are not required to. > >So I'd suggest that you might want to add a post-processing stage >to fix -- or at least "normalize" -- indenting and wrapping in >your output. > >That tools I use for that are xmllint (with --format) and/or Paul >DuBois's xmlformat: > > http://xmlhack.com/read.php?item=2154 > >So I would run the following to get "normalized" output of a >transform through sgml2xml: > > $ cat foo.xml | cloak | sgml2xml | uncloak \ > | xmllint --format | xmlformat > >xmlformat uses a config file that contains rules for how to handle >specific elements. I have posted the DocBook-specific >xmlformat.conf config file I use here: > > http://docbook.sourceforge.net/outgoing/xmlformat.conf > >I also checked it into the DocBook project contrib/tools area: > > http://cvs.sourceforge.net/viewcvs.py/docbook/contrib/tools/ > >One possible limitation in xmlformat is that it uses a funky >REX-based[1] parser instead of a "real" XML parser (see comment >about po2xml above). > > [1] http://www.cs.sfu.ca/~cameron/REX.html > >But despite that, I have never had problems with it choking on any >documents or doing something unexpected to them. For me at least, >it always works exactly as I'd expect. So I highly recommend at >least trying it. It's the only tool I have ever found that does a >complete & correct job of normalizing/pretty-printing XML content. > > Thanks a lot for all that info!! Bernd
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]