dita message

Subject: Spec Clarification Issue: Characters Allowed in Key and Key Scope Names

From: Eliot Kimber <ekimber@contrext.com>
To: dita <dita@lists.oasis-open.org>
Date: Fri, 26 Feb 2016 08:56:52 -0600

Jarno Elovirta has raised the question of what specific characters are
actually allowed by the DITA specification for key names (and by
extension, key scope names)?

The DITA 1.2 specification says:

* Key names consist of characters that are legal in a URI. The case of key
names is significant.
* The following characters are prohibited in key names: "{", "}", "[",
"]", "/", "#", "?", and whitespace characters.



This statement is unchanged in DITA 1.3 (but moved to the reference entry
for the @keys attribute)


The problem here is that "characters that are legal in a URI" is not as
precise as perhaps we thought it was.

In particular, by "legal" do we mean by characters that are allowed in the
URI *string* before the URI is processed to resolve any escaped non-ASCII
characters or do we mean any character that may be used in a URI,
including characters that must be escaped in the ASCII encoding of a URI?
I suspect we intended the latter meaning but Jarno has interpreted it as
the former, more-restrictive meaning.

There is definitely value in allowing a wide range of characters as keys,
e.g., accented characters, characters from Asian and Middle Eastern
writing systems, etc.

The primary practical concern is string matching--processors have to be
able to reliably compare two key names to determine if they are or are not
the same. When you allow non-ASCII characters generally you run into
issues around how some characters might be composed when those characters
can be composed in several different ways per the Unicode spec, e.g.,
characters that include or can be used with diacritical marks. The XPath
specification has lots of language and infrastructure around this issue
(and might provide a short path to a solution if we need one).

So we need to clarify what the rules for key names are and publish that
clarification in some appropriate way.

A good option might be to use the XML NMTOKEN definition as the basis for
key names, as that already allows pretty much every useful Unicode
character and disallows characters we already don't want
(https://www.w3.org/TR/REC-xml/#sec-common-syn). The main problem I see
with NMTOKEN is that it disallows characters that are not explicitly
disallowed by the current definition and that are allowed by the
conservative interpretation of "legal in a URI", for example, "@" and "="
are allowed for URIs but disallowed in NMTOKEN. So that could be a deal
breaker.

Of course, we could define key in terms of NMTOKEN plus additional
characters. 

The thing Jarno is asking for is a precise definition of the characters
allowed, that is, an explicit list of characters and character ranges. It
looks to me like NMTOKEN with additions is our fastest route to a precise
definition. 

Cheers,

Eliot


----
Eliot Kimber, Owner
Contrext, LLC
http://contrext.com

Follow-Ups:
- RE: Spec Clarification Issue: Characters Allowed in Key and Key Scope Names
  - From: Scott C Hudson <scott.hudson@jeppesen.com>