OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

xri message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [xri] URI normalization and comparison (was Minutes: XRI TC Telecon 2-3PM PT Thursday 2009-07-16)



On Jul 23, 2009, at 4:26 PM, Scott Cantor wrote:

> Eran Hammer-Lahav wrote on 2009-07-23:
>> How about <Subject set="beginswith">?
>
> Will has some other questions about this, but I was wondering...is  
> there a
> URL normalization step here to deal with port 80/443 being present  
> or not in
> the matching string and case, and so forth?
>
> In other words, is http://FOO.COM/thing the same as http://foo.com:80/thing
> ?
>
> I was rather in favor of the original simple idea of breaking the  
> fields out
> into XML and making it easy to express the normalization/matching.


I'm happy with whatever works, but we don't really have a strong  
proposal for fields. Here's the inclusive field matchers used by  
POWDER[1]:
* schemes
* hosts
* ports
* exactpaths
* pathcontains
* pathstartswith
* pathendswith
* querycontains
* iripattern
* regex
* resources

Can we make it work with less that that? Seems like schemes+hosts+ports 
+pathstartswith should do the trick. iripattern[2] looks like it might  
be a solid replacement for schemes+hosts+ports, but it is a bit  
complicated to implement.


Regardless of the URI matcher syntax, there needs to be some sort of  
normalization for the URIs that are being matched against. Unless  
there's a reason not to, I'd recommend the same IRI canonicalization  
as POWDER[3]. This amounts to
* Convert to Unicode
* Convert percent encoded triples to literals, leaving spaces and  
reserved chars encoded
* Convert to Unicode Normalization Form C (NFC)

* Use a scheme of 'http' if one is not defined or empty
* Use a path of '/' if one is not defined
* Remove any trailing '.' characters from the host
* Convert a host with non-ASCII characters using the ToASCII  
operation, with theUseSTD3ASCIIRules flag unset and the  
AllowUnassigned flag set
* Convert scheme to lowercase
* Convert host to lowercase
* Remove the port if it is the default port

I'm pretty sure that this is just standard IRI->URI mapping with sane  
scheme- and protocol-based normalizations.

If you really want to investigate other normalization options, see
* RFC 3986 URI Syntax, section 6 Normalization and Comparison, http://tools.ietf.org/html/rfc3986#section-6
* RFC 3987 IRIs, section 5 Normalization and Comparison, http://tools.ietf.org/html/rfc3987#section-5
* URISpace 1.0, http://www.w3.org/TR/urispace

Joseph Holsten


P.S. Just in case you were wondering, URI matching is half reason that  
POWDER is a gigantic standard for the little task it accomplishes. I'm  
glad we aren't trying to tackle RDF serialization, which is the other  
half of their pain.


1: http://www.w3.org/TR/powder-grouping/#appA
2: http://www.w3.org/TR/powder-grouping/#wild
3: http://www.w3.org/TR/powder-grouping/#canon


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]