OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

legaldocml message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

Subject: Re: [legaldocml] On <eop> and <eol>

There is a conundrum about the use of XML for documents in these types of situations. Much of the XML being used is for structural purposes much like HTML. HTML added some new structural elements in HTML5 to add semantic meaning to the structural (block) elements. Of course, HTML still has <div class="x"> to fill in for web publishers to create their own. Something that was not added to HTML was a way to number lines or pages. Of course, using the <br /> and/or CSS allows for displaying text in a specific way.

However, for legal documents there is a tradition and explicit use of line and page numbers that is critical. And XML is a nested language, where the semantic structural elements can not be un-nested with physical representations of lines and pages. I have heard of efforts to come up with non-XML languages to deal with this. And there have been many efforts to sneak line and page numbers into XML.

I would like to step back and talk about the issue in terms of representing and relating the semantic and the physical. Right now the discussion is about the computer code level of the law versus the physical, often paper, and often official version of the law. I have heard some argue to just leave the physical version in the dustbin and move forward, but that is not possible for the law.

There is another systematic way to deal with this issue that also will answer some tough issues that have not even come up, like video and audio transcripts. And I will create a slide show for an upcoming meeting (probably after the XML Prague conference which is coming up). I will describe it briefly in this email. And rather than breaking the current non-XML friendly fix that Fabio proposes, I think we can leverage <eop/> and <eol/> empty tags to do double duty.

Imagine a piece of paper with text. You can use line numbers to quickly refer to a specific line. This is natural. However, the line is an artifact of the physical nature of paper. With HTML, CSS media or just changing the browser size can often change the soft breaks of a document's text. PDFs provide the verisimilitude that HTML and most semantic based XMLs can not. What can be done is to map the physical to the semantic. Point A to point B is a line. And those can be put in a table with representations of where that line maps to in the XML document. XPath allows a precise representation of each part of the document. However, nested tags for both semantic and physical points breaks XML. And empty tags, do not provide true XML answer. Really there should be a mapping document to relate the two.

For, example, I can create PDF fragment URLs that show where the physically the line is. I can also use XPath to create the two points (or more for columned text), that link to the physically represented URL. But rather than create the third document, the data can be stored in the HTML or XML document. Since <eol/> is the <b/> for the LegalDocML standard, and it is a quick fix for referring to a line's end, the <eol/> tag could also take additional attributes to hold the data for the beginning and endpoints of the physical line. This can work whether the physical copy is paper or PDF or not.

An advantage of the PDF is that there could also be an attribute that points to the URL fragment for the line in the document. Whether PDF is just photographs of handwritten documents or generated from XML, this could be a great resource.

This same system would allow for legal transcripts of audio or video to have similar beginning and end points embedded into the XML. (timecode is similar to lines and pages, as it is an artifact of the physicality of the medium). Media fragment URLs is a standard that W3C has been working on.

Again, I will create figures that might help to understand this methodology and data structure. But it deals much more cleanly with the representation of documents, yet can work within the current solution proposed by Fabio.

Daniel Bennett

On 2/10/2014 11:01 AM, Fabio Vitali wrote:
Dear all,

there has been in the past week an on-and-off discussion on the semantics and syntax of elements <eop> and <eol>.

Currently the semantics of <eop> and <eol> are to identify the end of page (and respectively end of line) as markers, i.e., by their mere presence at a given position. Therefore, they are placed nearest the end of the page and line according to the reference copy in printed form, and there is the assumption that the corresponding beginning of page and line are immediately after the previous <eop> and <eol> elements.

Additionally the syntax of <eop> and <eol> allows to add two attributes. The first parameter is breakat, that specifies the number of characters within the next word that the page (or line) actually breaks at. This allows in-word bearks to happen even if the word is not actually broken in the XML. The second parameter is number, which allows to specify a page number or line number for the element, especially if we did nit start at zero (which may happen if the document belongs to a container, e.g. a gazette or something).

Daniel Bennett has objected that a mechanism to identify the start of the page/line, as well as its end, is appropriate, and that for it one could use some form of indirect reference, such as the use of an XPointer or another pointer-like syntax to be determined.

The addition of yet another attribute is not a great complexity, and can be done quite easily. I am only perplexed about its need. The justification for such attribute, is for situations in which the beginning of the page/line is NOT immediately following the end of the previous page/line, but may happen somewhere else. I have not yet met such a situation. If there is a situation is which this happens, then I am more that happy to add the new attribute, but I would appreciate an example.

So if you do have such an example, please share it with the rest of the group, and we will make sure to have a new attribute in these elements.

Thanks and ciao



Fabio Vitali                            Tiger got to hunt, bird got to fly,
Dept. of Computer Science        Man got to sit and wonder "Why, why, why?'
Univ. of Bologna  ITALY               Tiger got to sleep, bird got to land,
phone:  +39 051 2094872              Man got to tell himself he understand.
e-mail: fabio@cs.unibo.it         Kurt Vonnegut (1922-2007), "Cat's cradle"

To unsubscribe from this mail list, you must leave the OASIS TC that
generates this mail.  Follow this link to all your TCs in OASIS at:

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]