relax-ng message

Subject: [relax-ng] Java-style Unicode escapes

From: James Clark <jjc@jclark.com>
To: RELAX NG Mailing List <relax-ng@lists.oasis-open.org>
Date: Thu, 04 Apr 2002 17:36:59 +0700

I have looked into how hard it would be to support Java style Unicode
escapes in a parser implementation based on JavaCC, and I think it is
doable.

In Java, \u is followed by exactly 4 hex digits.  Given the XML
character model, we would need 6 hex digits, but I would suggest that
it would be better to avoid a fixed number of hex digits and instead
use an explicit delimiter.  In the regexp syntax of XML Schema Part 2
(following the Perl regexp syntax), there are two escape sequences
that take a variable length argument \p and \P, and these use the
syntax \p{arg} and \P{arg}. I think this is quite readable and would
suggest we use the syntax \u{N}, where N is one or more hex digits.

As in Java, these escape sequences would be allowed anywhere and would
be handled (conceptually at least) by a separate preprocessing phase
that results in a sequence of Unicode characters which is then parsed
with respect to the grammar for the compact syntax.  Note that the
mapping of CR/LF and CR into LF will have to take place during or
before this phase, since we wouldn't want \u{D} to be mapped into
\u{A} within a string.

I would propose that only "\u{" be recognized as the start of the
escape sequence.  In other words, a "\" followed by something other
than "u" or a "\u" followed by something other than "{" would not be
an error but would be passed through unchanged by this phase.
However, it would be an error if there was a "\u{" that was not
followed by one or more hex digits and a "}".

It would be an error to use \u{N} if N is a character that is not
allowed by XML, i.e. if XML says that &#xN; is an error, then \u{N}
should also be an error.  (Otherwise we lose translatability of the
non-XML syntax to the XML syntax.)

Java allows multiple "u" characters following the "\". The idea is to
allow a reversable translation into ASCII: the translation replaces
every non-ASCII character by the \u equivalent, and adds an additional
"u" to each existing \u sequence.  This seems like a useful
capability, and I would propose that we keep it.

Java also has a feature that a "\u" is recognized as starting a
Unicode escape only if the number of "\"s immediately preceding the
"\u" is even.  I think this only makes sense given Java's treatment of
"\\" in string literals, which we do not have, so I would propose not
to keep this feature.

Note that although we are already using \ to prevent a name being
treated as a keyword, the only possibility of conflict is if somebody
writes something like:

  element \u{ empty }

both unnecessarily quoting the name "u" and omitting whitespace
between the "\u" and the "{".  I don't think this is going to cause
problems in practice, but if people are concerned by this double usage
we could consider either using a different character for escaping
keywords or using a different character to introduce the Unicode
escape.

Note that this doesn't affect the syntax for getting a
double/single-quote into a string delimited by double/single-quotes,
which is to double the quote.

James

Follow-Ups:
- Re: [relax-ng] Java-style Unicode escapes
  - From: James Clark <jjc@jclark.com>
- Re: [relax-ng] Java-style Unicode escapes
  - From: John Cowan <jcowan@reutershealth.com>