legaldocml message

Subject: Re: [legaldocml] On <eop> and <eol>

From: Daniel Bennett <daniel@citizencontact.com>
To: Fabio Vitali <fabio@cs.unibo.it>, legaldocml@lists.oasis-open.org, akomantoso-xml@googlegroups.com
Date: Mon, 10 Feb 2014 11:44:10 -0500

There is a conundrum about the use of XML for documents in these typesof situations. Much of the XML being used is for structural purposesmuch like HTML. HTML added some new structural elements in HTML5 to addsemantic meaning to the structural (block) elements. Of course, HTMLstill has <div class="x"> to fill in for web publishers to create theirown. Something that was not added to HTML was a way to number lines orpages. Of course, using the <br /> and/or CSS allows for displaying textin a specific way.

However, for legal documents there is a tradition and explicit use ofline and page numbers that is critical. And XML is a nested language,where the semantic structural elements can not be un-nested withphysical representations of lines and pages. I have heard of efforts tocome up with non-XML languages to deal with this. And there have beenmany efforts to sneak line and page numbers into XML.

I would like to step back and talk about the issue in terms ofrepresenting and relating the semantic and the physical. Right now thediscussion is about the computer code level of the law versus thephysical, often paper, and often official version of the law. I haveheard some argue to just leave the physical version in the dustbin andmove forward, but that is not possible for the law.

There is another systematic way to deal with this issue that also willanswer some tough issues that have not even come up, like video andaudio transcripts. And I will create a slide show for an upcomingmeeting (probably after the XML Prague conference which is coming up). Iwill describe it briefly in this email. And rather than breaking thecurrent non-XML friendly fix that Fabio proposes, I think we canleverage <eop/> and <eol/> empty tags to do double duty.

Imagine a piece of paper with text. You can use line numbers to quicklyrefer to a specific line. This is natural. However, the line is anartifact of the physical nature of paper. With HTML, CSS media or justchanging the browser size can often change the soft breaks of adocument's text. PDFs provide the verisimilitude that HTML and mostsemantic based XMLs can not. What can be done is to map the physical tothe semantic. Point A to point B is a line. And those can be put in atable with representations of where that line maps to in the XMLdocument. XPath allows a precise representation of each part of thedocument. However, nested tags for both semantic and physical pointsbreaks XML. And empty tags, do not provide true XML answer. Really thereshould be a mapping document to relate the two.

For, example, I can create PDF fragment URLs that show where thephysically the line is. I can also use XPath to create the two points(or more for columned text), that link to the physically representedURL. But rather than create the third document, the data can be storedin the HTML or XML document. Since <eol/> is the <b/> for the LegalDocMLstandard, and it is a quick fix for referring to a line's end, the<eol/> tag could also take additional attributes to hold the data forthe beginning and endpoints of the physical line. This can work whetherthe physical copy is paper or PDF or not.

An advantage of the PDF is that there could also be an attribute thatpoints to the URL fragment for the line in the document. Whether PDF isjust photographs of handwritten documents or generated from XML, thiscould be a great resource.

This same system would allow for legal transcripts of audio or video tohave similar beginning and end points embedded into the XML. (timecodeis similar to lines and pages, as it is an artifact of the physicalityof the medium). Media fragment URLs is a standard that W3C has beenworking on.

Again, I will create figures that might help to understand thismethodology and data structure. But it deals much more cleanly with therepresentation of documents, yet can work within the current solutionproposed by Fabio.


Daniel Bennett
daniel@citizencontact.com





On 2/10/2014 11:01 AM, Fabio Vitali wrote:

Dear all,

there has been in the past week an on-and-off discussion on the semantics and syntax of elements <eop> and <eol>.

Currently the semantics of <eop> and <eol> are to identify the end of page (and respectively end of line) as markers, i.e., by their mere presence at a given position. Therefore, they are placed nearest the end of the page and line according to the reference copy in printed form, and there is the assumption that the corresponding beginning of page and line are immediately after the previous <eop> and <eol> elements.

Additionally the syntax of <eop> and <eol> allows to add two attributes. The first parameter is breakat, that specifies the number of characters within the next word that the page (or line) actually breaks at. This allows in-word bearks to happen even if the word is not actually broken in the XML. The second parameter is number, which allows to specify a page number or line number for the element, especially if we did nit start at zero (which may happen if the document belongs to a container, e.g. a gazette or something).

Daniel Bennett has objected that a mechanism to identify the start of the page/line, as well as its end, is appropriate, and that for it one could use some form of indirect reference, such as the use of an XPointer or another pointer-like syntax to be determined.

The addition of yet another attribute is not a great complexity, and can be done quite easily. I am only perplexed about its need. The justification for such attribute, is for situations in which the beginning of the page/line is NOT immediately following the end of the previous page/line, but may happen somewhere else. I have not yet met such a situation. If there is a situation is which this happens, then I am more that happy to add the new attribute, but I would appreciate an example.

So if you do have such an example, please share it with the rest of the group, and we will make sure to have a new attribute in these elements.

Thanks and ciao

Fabio



--

Fabio Vitali                            Tiger got to hunt, bird got to fly,
Dept. of Computer Science        Man got to sit and wonder "Why, why, why?'
Univ. of Bologna  ITALY               Tiger got to sleep, bird got to land,
phone:  +39 051 2094872              Man got to tell himself he understand.
e-mail: fabio@cs.unibo.it         Kurt Vonnegut (1922-2007), "Cat's cradle"
http://vitali.web.cs.unibo.it/





---------------------------------------------------------------------
To unsubscribe from this mail list, you must leave the OASIS TC that
generates this mail.  Follow this link to all your TCs in OASIS at:
https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php

Follow-Ups:
- Re: [legaldocml] On <eop> and <eol>
  - From: daniel@citizencontact.com

References:
- On <eop> and <eol>
  - From: Fabio Vitali <fabio@cs.unibo.it>