OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

entity-resolution message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]


Subject: minutes from ER TC 20010611


Present: David, Lauren, Paul, Norm, John, Tony

Normalizing system IDs:

open question is what do we do about normalization of characters; there are some 
which are not allowed in URIs by RFC 2396. There are reserved characters and 
unwise characters.

What happens if I put a reserved character in a URI? (E.g., $) It can appear in an 
attribute value in an XML document. In the system ID in the DOCTYPE, they may or 
may not %-escape the $. We may have a $ in the catalog; the user may decide they 
should %-escape it in some place or another. If they don't %-escape it in both 
places, should it match anyway?

If we say no, there is a potentially interoperability issue if one parser does the 
escaping for you. The catalog then needs at least 2 different entries to match 
both the escaped and unescaped versions. The unwise characters are even worse.

What should we do? Norm's first proposal:
Catalog processors must reduce the %-escaped characters to the equivalent octet 
before comparison. This is better than turning octets into %-escaped characters. 
These characters are just for comparison, not for sending over the wire. We are 
only looking for the first match. One consequence would be that there are problems 
if there is a % in there but it isn't a %-escaped character. There may be 2 URIs 
which differ in that one has an escaped character (e.g. for a slash) which should 
not be turned back into the decoded character because the difference in semantics 
means they are different URIs. So this proposal doesn't work.

Characters which are not allowed to appear in URIs must be %-escaped. Unreserved 
characters can be escaped but don't have to be.

When presented with a system ID or URI reference, we should %-escape any character 
which is not unreserved. XML 1.0 2nd ed. does specify some of this; we need to 
decide whether the catalog processor is part of the XML processor or the 
application. If this escaping has been done once, what happens if it's done again? 
The % character is not %-escaped in section 4.2.2 in XML 1.0 2nd ed. So this 
escaping would basically be a no-op. The XML processor must escape disallowed 
characters. The XML processor doesn't know what attribute values are URIs so won't 
%-escape everything anyway. So one proposal is that the catalog processor should 
apply the same %-escaping as in XML to all URIs.

After the %-encoding, we treat things as strings. A system ID can't have a 
fragment ID. In our URI matching of IDs, we currently say they are URI references, 
which may have fragment identifiers. So we really just take them as strings. If 
there are a lot of URI references to the same document, then you need an entry in 
the catalog for each fragment ID in use.

Consensus on the fact that the catalog processor does the %-escaping, as per XML 
1.0 2nd ed.

Do we need a mechanism to match URIs starting with foo to URIs starting with bar? 
Or a mechanism to make the processor aware of the #?

Proposal: new type of catalog entry which is used during URI lookup. It maps URIs 
beginning with prefix1 to URIs beginning with prefix2, keeping the suffixes the 
same. Useful for mirror sites, fragment IDs, mapping absolute URIs to relative URIs.

Useful for system IDs; nobody wants it for public IDs.

We need a name for this: rewrite (comes from Apache). Precedence should be after 
direct match and before delegate. No objections to the name or the precedence.

June 8th spec has the fixes for URI and URI reference.

Is a catalog allowed by XML 1.0 2nd ed? We think so.

Norm will get the draft out this week. We will try next week to flush out the last 
remaining issues. Pay particular attention to the URI/URI reference wordings. Do 
we ever talk of URIs? Only in baseURI. Other than that, they are URI references.

Lauren

-----------

Lauren Wood, Director of Product Technology, SoftQuad Software
Chair, XML 2001 - Call for presentations now open at www.xmlconference.org



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]


Powered by eList eXpress LLC