[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Subject: DOCBOOK-APPS: Character encoding problems in text files included withSaxon extensions
Hi, newer releases of XSL stylesheets contain extensions for Saxon, which are able to include external text files, generate callouts and so on. I'm heavily using external file inclusion mechanism, because my documents contain a lot of examples in XML, XSLT, and other languages which are quite often modified or updated. Inclusion feature is available in DSSSL stylesheet for a long time. Before Saxon extensions were published by Norm I have used my own extension for XT. When I switched to Saxon extension I found one problem with current implementation. File reading extension (class Text.java) is able to read files in UTF-8 as it uses DataInputStream. If you have your included files in UTF-8 or ASCII everything works fine. I must use accented characters in my documents and it is convenient for me to use single byte encodings like iso-8859-2 and windows-1250 rather then UTF-8. Using current implementation incorrectly interpretes non-ASCII characters because their codes are different in single bytes encoding and in UTF-8. From my point of view, most user have stored their files in system default encoding so it would be more appropriate to assume that included files are in this system encoding rather than in UTF-8. I would like to know, in what encoding most of DocBook users store their externaly included files. If majority of them uses system encoding (this is probably ASCII or ISO Latin 1 for most English speaking authors) rather then UTF-8, it would be useful to use InputStreamReader instead of DataInputStream. InputStreamReader automatically converts content of file from system encoding to Java Unicode characters. In addition to default usage of system encoding, we could provide some mechanism how to specify encoding of included file. InputStreamReader is able to convert files from many encodings, so adding some attribute, notation or parameter to DocBook source would be quite easy. E.g. <inlinegraphics format="linespecific" fileref="example_with_russian_comments.java;charset=iso-8859-5"/> or <inlinegraphics format="linespecific" fileref="example_with_russian_comments.java" role="charset=iso-8859-5"/> or <inlinegraphics format="linespecific;charset=iso-8859-5" fileref="example_with_russian_comments.java"/> For now, I'm using modified version of Norm's extension. Of course, it would be more convenient for me to use standard version which can deal with files in other than UTF-8 encoding. So what is your opinion? Jirka ----------------------------------------------------------------- Jirka Kosek e-mail: jirka@kosek.cz http://www.kosek.cz
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Powered by eList eXpress LLC