From Joseph: gcs, lcs, both, neither; whitespace

For some reason, Joseph's messages to the list were bouncing, so he asked me to forward this.

Joseph, I don't think your super short version is right. The four GCS characters ( = @ $ + ) can be standalone, or be followed by the two LCS characters ( ! * ), or be followed by a literal.

The two LCS characters can be standalone or followed by a literal.

And - in the original ABNF - a literal can stand alone at the start of a segment.

But those are the only 3 cases.

I am finding myself agreeing with Markus that I'd prefer to get rid of bare literals and have all XDI subsegments be delimited. That still works nicely with my xdi-text proposal of wrapping XDI addresses in [/ ] syntax and XDI JSON strings in [{ }] syntax.

RE whitespace, it is simply not allowed anywhere in an XDI address. Only inside an XDI literal. And it is ignored completely in the JSON serialization as well.

That leaves us with just the question of colons, and whether they should be required as the starting delimiter of IRIs or not.

Thoughts?

Begin forwarded message:

From: Joseph Boyle <boyle.joseph@gmail.com>

Subject: Re: gcs, lcs, both, neither; whitespace

Date: February 5, 2013 4:54:10 PM PST

To: Markus Sabadello <markus.sabadello@gmail.com>

Cc: xdi2@googlegroups.com, OASIS - XDI TC <xdi@lists.oasis-open.org>
And if gcs and lcs are not completely independent, would we want to consider viewing a subsegment like this:

cs = "=" / "@" / "+" / "$" / "*" / "!" / "=*" / "@=" / "+*" / "$!" / "=!" / "@!" / "+!" / "$!"

subseg = cs [ xref / literal ]
On Feb 5, 2013, at 4:33 PM, Joseph Boyle <boyle.joseph@gmail.com> wrote:
A terse grammar would be one view we can give to help understand, but doesn't have to be what goes into a parser generator; grammars we use just have to agree on whether any given string is legal XDI or not.

Currently there are these possibilities for the prefix of a subsegment:

gcs: = @ + $

gcs lcs: =* @= +* $* =! @! +! $!

lcs: * !
neither: (only allowed if this is the only subsegment of the segment)

Is the prefix really best viewed as the cross product of two independent terms, with one case of 8 then arbitrarily excluded?
Or do the 8 possible nontrivial gcs lcs concatenations each have an individual meaning? Are they all meaningful and in use or is there at least one that should never be used?

I see what you mean about allowing concatenated literals. The obvious interpretation is reading a string of xdi-pchar as a single literal. I think most language parsers are classified as "greedy" and will do this. I don't know if the parser generators we're using would have a problem with it.

I think this is partly because we currently don't explicitly consider whitespace in the grammar, and should, as programming language grammars do. I'm guessing the current de facto treatment in XDI2 is allowing and discarding spaces and tabs between the higher-level productions but not expecting spaces between lower-level productions; "= m a r k u s" should probably not be equivalent to "=markus". Generally in a language where a sequence of two identifiers without other intervening operator is allowed, they will be separated by whitespace.
On Feb 5, 2013, at 2:41 PM, Markus Sabadello <markus.sabadello@gmail.com> wrote:
I think I agree we don't need a leading delimiter for IRIs.

I'm now totally against any "bare literals", except in an xref.

I also agree the xdi-text and xdi-json could be useful, but I don't how this is relevant to the other ABNF questions at hand.

I think in IRIs, parens should be percent-encoded to avoid confusion with XDI xrefs.
Markus
On Tue, Feb 5, 2013 at 12:30 AM, Drummond Reed <drummond@connect.me> wrote:
Joseph, this is really cool!

It also highlights the important decision we have to make about bare literals. Seeing it put this way makes it even clearer how streamlined the ABNF can be if we allow bare literals.

Over the weekend I also had an idea about how to deal with the bare literal problem in the first segment of an XDI address when it exists in the wild (which may be an edge case, but still one we need to deal with).

The idea is for the xdi-text rule that I propose in https://wiki.oasis-open.org/xdi/XdiAbnf/Discussion (which is basically to enclose XDI addresses or XDI JSON documents that appear running text insidesquare brackets to make them easy to recognize and parse) to use an additional forward slash to prefix the first segment of an XDI address. So instead of an XDI text block that contains an XDI statement consisting of all bare literals looking like:

[abc/def/xyz]

....it would look like this:

[/abc/def/xyz]

The reason I particularly like this is that now an XDI text block in any running text or markup document would be recognizable using just two rules:
An embedded XDI address would always start with [/
An embedded XDI JSON document would always start with [{
Examples of the first rule:
[/=drummond]
[/=drummond/+friend/(http://xdi.org/user/markus)]
Example of the second rule:
[{"=drummond/+friend":["(:http://xdi.org/user/markus)"]}]
If this approach can solve the problem of bare literals being allowed at the start of a first segment, then the question is: should we allow bare literals at the beginning of any segment in order to support this very streamlined ABNF parsing?

The second question is: should we stay with the current approach of just delimiting an IRI inside a cross-reference by looking for the colon following the scheme name (which is required by IRI syntax), or should we require a leading delimiter? My gut is the same as Joseph's here, which is that it is okay to parse for the colon following the IRI scheme name. Even though this is a little bit slower than just looking for a leading colon, it is simpler because it only requires "wrapping" the URI in parentheses.

How do others feel about these two questions?

=Drummond
On Mon, Feb 4, 2013 at 11:01 AM, Joseph Boyle <planetwork@josephboyle.net> wrote:

Now not up to date, but for reference, I made these the other day with http://railroad.my28msec.com/rr/ui :

<address.png>

<subseg.png>

<xref.png>

<literal.png>

The EBNF input was:
address ::= subseg+ ('/' subseg+ ('/' subseg+)?)?
subseg ::= [=@+$] [*!] (xref | literal)
xref ::= '(' (IRI | address) ')'

literal ::= (iunreserved | pct-encoded | [&;,':])+

On Feb 3, 2013, at 6:58 PM, Joseph Boyle <planetwork@josephboyle.net> wrote:

Drummond,

You are correct, excluding a specific trivial case can actually force more complexity in rules. The old ABNF had some examples of this. This is one reason why allowing a bare literal as a segment seems more natural to me.

The xref rule with added initial colon might need more grouping brackets: "(" [ [ ":" IRI ] / address ] ")"

I actually think allowing simply (http:// … ) with its own noninitial colon as an IRI xref would only add a little (finite) complexity to parsing, as opposed to some of the exponentially growing parse trees we may have been hitting in the past, and would look good for XDI's first-class support of IRIs - I posted a comment on this to https://wiki.oasis-open.org/xdi/XdiAbnf/Discussion earlier today.

Also noted the tel: and sms: schemes can have matched parentheses in their bodies, so if we allow these we may have to allow matched parentheses in IRIs, and do parenthetical depth counting as we parse IRIs, unless we require clients to escape and unescape all the internal parens. If we're scanning for parens, checking for the internal colon after the scheme is not much additional work.

Joseph

On Feb 3, 2013, at 2:33 PM, Drummond Reed <drummond@connect.me> wrote:

Joseph,

First, thanks very much for this analysis of the ABNF. I hadn't appreciated it in detail until I studied after Friday's telecon. Condensing it down to four lines is a FANTASTIC way of seeing the essence of the ABNF.

Based on our discussion on Friday's call, and if you follow the recommendations I posted to https://wiki.oasis-open.org/xdi/XdiAbnf/Discussion (namely, not allowing colons in literals, and using colons to prefix IRIs within cross-references), here's a revised version of your four-line ABNF if bare literals are allowed to begin segments:

OPTION #1: IF BARE LITERALS ARE ALLOWED

address = 1*subseg [ "/" 1*subseg [ "/" 1*subseg ] ] ;
subseg = [ "=" / "@" / "+" / "$" ] [ "*" / "!" ] [ xref / literal ] ;
xref = "(" [ ":" IRI / address ] ")";

literal = 1*[ iunreserved / pct-encoded ] ;

If bare literals are NOT allowed, as in the proposal we discussed on Friday, then I could only condense the ABNF into six rules

OPTION #2: IF BARE LITERALS ARE NOT ALLOWED

address = 1*subseg [ "/" 1*subseg [ "/" 1*subseg ] ]

subseg = global / local / xref
global = ( "=" / "@" / "+" / "$" ) [ "*" / "!" ] [ xref / literal ]

local = ( "*" / "!" ) [ xref / literal ]
xref = "(" [ ":" IRI / address / literal ] ")"
literal = 1*[ iunreserved / pct-encoded ]

Two questions:
Am I missing something - do you see a way to compact it further?
Will there be any real difference in efficiency of parsing between these two (given that Option #2 is actually narrower than Option #1 because it excludes bare literals)?
Thanks,

=Drummond

On Fri, Feb 1, 2013 at 9:34 AM, Joseph Boyle <planetwork@josephboyle.net> wrote:

Markus, thanks for the recognition, glad to be able to help out.

Drummond, do we need to exclude bare literals as segments at the syntax level? It seems to me they may be semantically trivial, but are syntactically consistent.

Just experimenting with finding a minimal set of verification rules (for clarity, omitting naming all the productions we want as parsing results) if bare literals are allowed, the grammar can be as short as:

address = 1*subseg [ "/" 1*subseg [ "/" 1*subseg ] ] ;
subseg = [ "=" / "@" / "+" / "$" ] [ "*" / "!" ] [ xref / literal ] ;

xref = "(" [ IRI / address ] ")";
literal = 1*[ iunreserved / pct-encoded / "&" / ";" / "," / "'" / ":" ] ;

On Jan 31, 2013, at 11:30 PM, Drummond Reed <drummond@connect.me> wrote:

Markus, thanks, this is great work. I have reviewed this and am in agreement with the changes.

The support for a literal as a standalone value at the start of a XDI segment has always been somewhat theoretical, i.e., we originally did it that way to not rule it out (because the preceeding slash could be a delimiter). But that does not work for the first segment of an XDI address.

So I agree that it's cleaner to just require all XDI segments to start with delimited subsegments.

I'll add this to the agenda for tomorrow's telecon.

=Drummond

On Thu, Jan 31, 2013 at 5:51 PM, Markus Sabadello <markus.sabadello@xdi.org> wrote:

Hello XDI TC,

Based on implementation experience and some discussions, I added another slightly changed version of the XDI ABNF to the discussion page of the relevant proposal:
https://wiki.oasis-open.org/xdi/XdiAbnf/Discussion

Here's the summary from the page:
1. Some of the changes here are motivated by the insight that the purpose of an ABNF is not only to validate a string against a set of rules, but also to semantically understand the various components of that string.

2. The "xdi-inner-graph" rule is introduced, in order to have an explicit rule for this fundamental XDI construct. This change doesn't affect what is valid XDI and what is not.
3. The "xdi-context" rule is introduced, for the same reason.

4. The "xdi-segment" rule is changed to no longer permit a literal at the beginning. A segment that does not start with a context symbol, and is not a cross-reference, does not appear to be useful, and it might be ambiguous with regard to other rules.

5. The "xref-literal" rule is introduced, in order to still allow literals in cross-references.

I tested this ABNF in the XDI2 library, and it appears to work fine.

In fact, I have recently added to XDI2 support for a new parser library (APG), in addition to the one I had been using before (aParse).

After evaluating them both, my conclusion is that they are both able to handle the XDI ABNF, that they produce the same results, and that APG is about twice as fast as aParse.
So APG will now be standard in XDI2, but aParse is optionally also still supported.

I have spent quite some time thinking about Joseph Boyle's ideas about optimizing the parsing process in smart ways, for example by simply "skipping" from an opening "(" to a closing ")" in order to avoid having to descend deep into the IRI rules. This sounds quite good to me, I just haven't found a way to actually implement that yet in a way that still ensures robustness and correctness of the parsing process. I think it was also Joseph who early on suggested that XRI parsing might be one of the most resource-intensive tasks of an XDI server, and I think that is very right. So while switching to a faster parsing library is a great step, we'll keep looking for further optimizations.

You can use the following tool to experiment with the most recent ABNF proposal I mentioned above:
http://xdi2.projectdanube.org/XRIParser

Markus

--
You received this message because you are subscribed to the Google Groups "XDI2" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xdi2+unsubscribe@googlegroups.com.
To post to this group, send email to xdi2@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "XDI2" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xdi2+unsubscribe@googlegroups.com.
To post to this group, send email to xdi2@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

xdi message