[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [xri] URI normalization and comparison (was Minutes: XRI TC Telecon 2-3PM PT Thursday 2009-07-16)
On Jul 23, 2009, at 4:26 PM, Scott Cantor wrote: > Eran Hammer-Lahav wrote on 2009-07-23: >> How about <Subject set="beginswith">? > > Will has some other questions about this, but I was wondering...is > there a > URL normalization step here to deal with port 80/443 being present > or not in > the matching string and case, and so forth? > > In other words, is http://FOO.COM/thing the same as http://foo.com:80/thing > ? > > I was rather in favor of the original simple idea of breaking the > fields out > into XML and making it easy to express the normalization/matching. I'm happy with whatever works, but we don't really have a strong proposal for fields. Here's the inclusive field matchers used by POWDER[1]: * schemes * hosts * ports * exactpaths * pathcontains * pathstartswith * pathendswith * querycontains * iripattern * regex * resources Can we make it work with less that that? Seems like schemes+hosts+ports +pathstartswith should do the trick. iripattern[2] looks like it might be a solid replacement for schemes+hosts+ports, but it is a bit complicated to implement. Regardless of the URI matcher syntax, there needs to be some sort of normalization for the URIs that are being matched against. Unless there's a reason not to, I'd recommend the same IRI canonicalization as POWDER[3]. This amounts to * Convert to Unicode * Convert percent encoded triples to literals, leaving spaces and reserved chars encoded * Convert to Unicode Normalization Form C (NFC) * Use a scheme of 'http' if one is not defined or empty * Use a path of '/' if one is not defined * Remove any trailing '.' characters from the host * Convert a host with non-ASCII characters using the ToASCII operation, with theUseSTD3ASCIIRules flag unset and the AllowUnassigned flag set * Convert scheme to lowercase * Convert host to lowercase * Remove the port if it is the default port I'm pretty sure that this is just standard IRI->URI mapping with sane scheme- and protocol-based normalizations. If you really want to investigate other normalization options, see * RFC 3986 URI Syntax, section 6 Normalization and Comparison, http://tools.ietf.org/html/rfc3986#section-6 * RFC 3987 IRIs, section 5 Normalization and Comparison, http://tools.ietf.org/html/rfc3987#section-5 * URISpace 1.0, http://www.w3.org/TR/urispace Joseph Holsten P.S. Just in case you were wondering, URI matching is half reason that POWDER is a gigantic standard for the little task it accomplishes. I'm glad we aren't trying to tackle RDF serialization, which is the other half of their pain. 1: http://www.w3.org/TR/powder-grouping/#appA 2: http://www.w3.org/TR/powder-grouping/#wild 3: http://www.w3.org/TR/powder-grouping/#canon
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]