RE: [legaldocml] On ids and numbering

Thanks to help from Monica, I better understand the hcontainer name="value" issue. Although it does allow for consistency, it both breaks the semantic nature of the id values and reinforces the general AKN preference for core elements against extensibility. I might still allow jurisdictions that have their own names that get used in the hcontainer name attribute to have an alternative convention of using the tilde prefix to the name attribute.

However, the abbreviating of the element names is still very troublesome and no one has explained why it is absolutely necessary. And even to the extent that abbreviating helps with shorter ids, it does not fully solve the issue of the sixteen character limit that eXist apparently has.

-----Original Message-----
From: daniel@citizencontact.com
Sent: Tuesday, December 10, 2013 10:58am
To: "Fabio Vitali" <fabio@cs.unibo.it>
Cc: "legaldocml@lists.oasis-open.org" <legaldocml@lists.oasis-open.org>, "akomantoso-xml@googlegroups.com" <akomantoso-xml@googlegroups.com>
Subject: RE: [legaldocml] On ids and numbering

Link to my response with full text below:

https://docs.google.com/document/d/1rGvVnvg4PaWD_ql2QjK1nqfq7WtPUJlCnOV0dUoDU9E/pub

First, I am very excited about the potential for using the id within legal documents. This will make the electronic citation possibilities stronger and easier to integrate with other document types. For example, having opinion papers, journal articles and other documents that refer to multiple fields can be better structured and useful.

However, there are some issues that seem apparent to me with the current recommendation. I would start out with what I believe are the requirements for a well constructed protocol for assigning id attribute values.

Most importantly, the rules must never allow for collisions, that is, multiple equal values for the id attributes is made impossible. Currently, the recommended rules do not seem to rule out such collisions.
I think that it is a good practice to either use human understandable or clearly computer friendly systems. Some systems use both, like including in a URL both the title and serial number often divided by a slash as a reasonable compromise. But I think mixing the two, is a mistake and abbreviations that do not systematically correspond with the other portions are a problem. I will go through the problems and solutions for this below.
There must be an automated method for generating the ids, at least for most jurisdictions. I have researched other systems that effectively use XSL to auto-generate ids. The current recommendation, I believe, makes this impossible for many AKN developers.
It should not be a requirement to meet a specific tools’ capabilities to design a system. Y2K was an example of a bad design decision based on current capabilities. If this is an overwhelming current need, then I would use an alternative system for id values. I found a few that would meet the need, but still meet the other requirements I listed.

I realize that id attribute values in use are often globally unique values within a greater set of documents or, are semantic in quality and/or endeavor to match an XPath like value. I see value in any of these methods of determining the id values. For legal documents, going with the XPATH like value has great utility and logic. Also the recommendation includes a semantic like system in that it mirrors the semantically named elements.

My issues are with the specific methods of attaining these goals.

Using abbreviations for the elements creates problems.
- It diminishes the semantic nature
- It may diminish the ability to automatically generate id values that are unique
- It will not meet the 16 character length limit for indexing by a specific tool being used
- There is not currently a method for embedding or attaching a list of the abbreviations which is necessary if abbreviations end up in the final recommendation
The current recommendation, by not differentiating the element name id equivalent with potential user determined attribute values for the hcontainer, will lead to collisions.

My suggested fixes for dealing with the abbreviations:

Either stop abbreviating the element equivalent id values, or
Rename the elements to match the abbreviations, or
Include a table or method of embedding the element name to abbreviation that can be accessed by automated processes, or
Change completely to a system that is less semantic, and is XPATH like with child/level abbreviations. I have seen example of this method where each nested level is about four characters or less, meeting the indexing desire.

My suggestion for dealing with the element name and hcontainer attribute value potential collisions:

Never allow hcontainer name attribute values to be abbreviated for use in the id attribute otherwise this may allow yet more collisions (pippo and pippopippo might get the same five character abbreviation).
- Or include a rigorous abbreviation system for user defined hcontainer name attribute values.
There must be a method, special character or something within an id attribute to differentiate the element name with an hcontainer name attribute value. This could be a tilde or other URI/IRI friendly (not needing to be escaped) character not already specified for the id values.

Generating an XSL auto generator or end user parser of the id values will be more difficult or possible without dealing with the two issues I listed.

Final word:

I would find acceptable any id value system that could be autogenerated using XSL for most jurisdictions. That might allow for end user tools as well. I do not believe that the current recommendation will allow for this. Perhaps each jurisdiction will need their own XSL generator, but that would be a poor outcome.

I would prefer a system that avoided the Y2K abbreviation even with safeguards. Either implement a rigorous XPATH simple level/nest system or fully implement the semantic equivalent name.

-----Original Message-----
From: "Fabio Vitali" <fabio@cs.unibo.it>
Sent: Tuesday, December 3, 2013 5:13am
To: "legaldocml@lists.oasis-open.org" <legaldocml@lists.oasis-open.org>
Cc: "akomantoso-xml@googlegroups.com" <akomantoso-xml@googlegroups.com>
Subject: [legaldocml] On ids and numbering

Dear all,

let me make a proposal on numbering and ids from the discussions we had on the past teleconfs. Most of the actual abbreviations I used are invented on the spot, so please don't look at them critically.

The generic syntax for an id is the following:

[prefix "__"] abbr ["_" num]

* prefix is a (possibly empty) string providing uniqueness to the remaining part of the id, and based on the context in which the element appears.
* abbr is an abbreviation describing the element, and it is drawn from the list of abbreviations that Veronique has first created and Grant improved.
* num is a (possibly empty) representation of the numbering of the element within its context. If the element is necessarily unique within its context, no numbering is used.

This is how to use this syntax:

* An explicitly numbered element is an element that is numbered by the author of the _expression_, so that it is not the task of the author of the markup to establish such number, but only to recognize it in the text. Such number is most frequently placed in a <num> element inside the element's structure. An implicitly numbered element, on the other hand, is an element that was not numbered by the author of the _expression_, and therefore must be numbered in some ways by the author of the manifestation, and in particular has no <num> element relating to itself anywhere.

* The context of the numbering of elements <X> are the containing elements <Y> that suggest, imply or force a re-start of the numbering of all internal <X>s. This can be either explicit (when the <X>s are explicitly numbered) or implicit (otherwise). Different contexts imply that elements with the same name may end up having the same abbr and the same number, and must therefore be disambiguated through the use of a prefix. The best option for such prefix is the id of the context element <Y>. For instance, in many traditions chapters restart numbering within every title, so "chp_2" for Chapter 2 could be ambiguous. Therefore, in these cases the id for Chapter 2 of Title I could be "ttl_I_chp_2" (assuming that "ttl_I" unambiguously identifies Title I).

* All document classes (act, bill, doc, etc.) are ALWAYS contexts. This means that, except particular cases, all numbers restart whenever a new document class is started (e.g., in a composite document each document component has its own local numbering). Similarly, <quotedStructure> and <extractStructure> are always contexts, EVEN IF they do not force a **restart** of the numbering, but just a different numbering context within themselves. Finally, plain inline elements are NEVER contexts. For instance, the id for article 12 in the first document of a composite document will be "doc_1__art_12", while in the second document it will be "doc_2__art_12".

* Elements that are necessarily unique within a given context will require NO numbering. For instance, there is exactly ONE <body> in acts and bills, and therefore its id can be simply "body" (or "doc_1__body" in case of a composite document, of course). Analogously, there is at most ONE <content> element inside articles or sections, and therefore the id of the <content> element of article 12 will be simply "art_12__cnt".

* What constitute a context is a tradition-dependent issue for explicitly numbered elements. That is to say, when the element is explicitly numbered we need to make sure whether the numberings starts multiple times in the same document. If this is the case, then the identification of the correct prefix requires the identification of the element causing the restart, and ITS id is used as prefix. For instance, while in many traditions articles are always globally numbered, in Latin America a special structure of transitional articles is added at the end of the document, whose numbering restarts. In this case, article 12 of the main part of the document will have id "main__art_12" while article 12 of the transitional part will be "transitional__art_12".

* For non-eplicilty numbered elements, on the other hand, it is a manifestation-level decision to determine an element smaller than the document class (if any) that would constitute the context for the element. Obvious choices would be:
- none (i.e., all non-numbered elements are numbered from the beginning of the relevant document class).
- the closest hcontainer (i.e., all elements within a hierarchy would be numbered after the id of the containing hierarchical elements).
- the closest containing element (ignoring containing inlines): block, container, container.

Suppose for instance that we are considering the numbering of the third <ref> element within the second <p> element of article 12. In turn, by counting from the smallest enclosing document class, article 12 is explicitly numbered, and the local tradition has no context except for the document, thus its id is always "art_12". The second <p> of the article is actually the 14th <p> of the whole document, and its third ref is actually the 9th of the whole document and the fifth of the article (because the first <p> has two more refs). Thus we have the following options:

- Case "none": 'art_12' for the <article>, 'p_14' for the <p> and 'ref_9' for the <ref>.
- Case "hcontainer": 'art_12' for the <article>, 'art_12__p_2' for the <p> and 'art_12__ref_5' for the <ref>.
- Case "closest containing element": 'art_12' for the <article>, 'art_12__p_2' for the <p> and 'art_12__p_2__ref_3' for the <ref>.

Of course additional complexities may arise from mixing up policies depending on the elements (e.g. making akoma-specific elements become hcontainer-driven and html-specific elements become container-driven), but I would argue against this policy. My vote is cast for the case "hcontainer" but I will not cut my wrists in case another option is voted. I would strongly defend all the other proposals made in this message.

* One last discussion item regards abundant or incomplete references. An abundant reference is a reference, in particular the fragment part of an IRI, that contains more information than needed to match it to the id of an element. An incomplete reference, on the other hand, contains less information than necessary and therefore may point to more than one possible destinations. BTW, we must never deal with abundant or incomplete ***id*** in the id attributes of elements, since ids are created by the author of a manifestation, and therefore we should expect him/her to know what is needed to establish the minimum complete set of information to create an unambiguous id. We should only deal with abundant or incomplete references, since the author of a reference could not know everything about the document being mentioned in the text of the reference., and therefore he/she might create an incorrect reference that has too much or too little information.

In case of abundant reference, the resolver should identify the relevant minimal id (if it exists) by removing prefixes until a perfect match is found; in case of missing reference, on the other hand, the resolver must establish an interactive session with the user similar to the process of resolving work-level IRIs, and determine the missing information necessary to identify the id of an unique element.

let me know what you think.

Ciao

Fabio

--

Fabio Vitali Tiger got to hunt, bird got to fly,
Dept. of Computer Science Man got to sit and wonder "Why, why, why?'
Univ. of Bologna ITALY Tiger got to sleep, bird got to land,
phone: +39 051 2094872 Man got to tell himself he understand.
e-mail: fabio@cs.unibo.it Kurt Vonnegut (1922-2007), "Cat's cradle"
http://vitali.web.cs.unibo.it/

---------------------------------------------------------------------
To unsubscribe from this mail list, you must leave the OASIS TC that
generates this mail. Follow this link to all your TCs in OASIS at:
https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php

legaldocml message