[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Subject: [relax-ng] Java-style Unicode escapes
I have looked into how hard it would be to support Java style Unicode escapes in a parser implementation based on JavaCC, and I think it is doable. In Java, \u is followed by exactly 4 hex digits. Given the XML character model, we would need 6 hex digits, but I would suggest that it would be better to avoid a fixed number of hex digits and instead use an explicit delimiter. In the regexp syntax of XML Schema Part 2 (following the Perl regexp syntax), there are two escape sequences that take a variable length argument \p and \P, and these use the syntax \p{arg} and \P{arg}. I think this is quite readable and would suggest we use the syntax \u{N}, where N is one or more hex digits. As in Java, these escape sequences would be allowed anywhere and would be handled (conceptually at least) by a separate preprocessing phase that results in a sequence of Unicode characters which is then parsed with respect to the grammar for the compact syntax. Note that the mapping of CR/LF and CR into LF will have to take place during or before this phase, since we wouldn't want \u{D} to be mapped into \u{A} within a string. I would propose that only "\u{" be recognized as the start of the escape sequence. In other words, a "\" followed by something other than "u" or a "\u" followed by something other than "{" would not be an error but would be passed through unchanged by this phase. However, it would be an error if there was a "\u{" that was not followed by one or more hex digits and a "}". It would be an error to use \u{N} if N is a character that is not allowed by XML, i.e. if XML says that &#xN; is an error, then \u{N} should also be an error. (Otherwise we lose translatability of the non-XML syntax to the XML syntax.) Java allows multiple "u" characters following the "\". The idea is to allow a reversable translation into ASCII: the translation replaces every non-ASCII character by the \u equivalent, and adds an additional "u" to each existing \u sequence. This seems like a useful capability, and I would propose that we keep it. Java also has a feature that a "\u" is recognized as starting a Unicode escape only if the number of "\"s immediately preceding the "\u" is even. I think this only makes sense given Java's treatment of "\\" in string literals, which we do not have, so I would propose not to keep this feature. Note that although we are already using \ to prevent a name being treated as a keyword, the only possibility of conflict is if somebody writes something like: element \u{ empty } both unnecessarily quoting the name "u" and omitting whitespace between the "\u" and the "{". I don't think this is going to cause problems in practice, but if people are concerned by this double usage we could consider either using a different character for escaping keywords or using a different character to introduce the Unicode escape. Note that this doesn't affect the syntax for getting a double/single-quote into a string delimited by double/single-quotes, which is to double the quote. James
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Powered by eList eXpress LLC