OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

legaldocml message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

Subject: Re: [legaldocml] On <eop> and <eol>

here is a link to my slides explaining this email.







-----Original Message-----
From: "Daniel Bennett" <daniel@citizencontact.com>
Sent: Monday, February 10, 2014 11:44am
To: "Fabio Vitali" <fabio@cs.unibo.it>, legaldocml@lists.oasis-open.org, akomantoso-xml@googlegroups.com
Subject: Re: [legaldocml] On <eop> and <eol>

There is a conundrum about the use of XML for documents in these types
of situations. Much of the XML being used is for structural purposes
much like HTML. HTML added some new structural elements in HTML5 to add
semantic meaning to the structural (block) elements. Of course, HTML
still has <div class="x"> to fill in for web publishers to create their
own. Something that was not added to HTML was a way to number lines or
pages. Of course, using the <br /> and/or CSS allows for displaying text
in a specific way.

However, for legal documents there is a tradition and explicit use of
line and page numbers that is critical. And XML is a nested language,
where the semantic structural elements can not be un-nested with
physical representations of lines and pages. I have heard of efforts to
come up with non-XML languages to deal with this. And there have been
many efforts to sneak line and page numbers into XML.

I would like to step back and talk about the issue in terms of
representing and relating the semantic and the physical. Right now the
discussion is about the computer code level of the law versus the
physical, often paper, and often official version of the law. I have
heard some argue to just leave the physical version in the dustbin and
move forward, but that is not possible for the law.

There is another systematic way to deal with this issue that also will
answer some tough issues that have not even come up, like video and
audio transcripts. And I will create a slide show for an upcoming
meeting (probably after the XML Prague conference which is coming up). I
will describe it briefly in this email. And rather than breaking the
current non-XML friendly fix that Fabio proposes, I think we can
leverage <eop/> and <eol/> empty tags to do double duty.

Imagine a piece of paper with text. You can use line numbers to quickly
refer to a specific line. This is natural. However, the line is an
artifact of the physical nature of paper. With HTML, CSS media or just
changing the browser size can often change the soft breaks of a
document's text. PDFs provide the verisimilitude that HTML and most
semantic based XMLs can not. What can be done is to map the physical to
the semantic. Point A to point B is a line. And those can be put in a
table with representations of where that line maps to in the XML
document. XPath allows a precise representation of each part of the
document. However, nested tags for both semantic and physical points
breaks XML. And empty tags, do not provide true XML answer. Really there
should be a mapping document to relate the two.

For, example, I can create PDF fragment URLs that show where the
physically the line is. I can also use XPath to create the two points
(or more for columned text), that link to the physically represented
URL. But rather than create the third document, the data can be stored
in the HTML or XML document. Since <eol/> is the <b/> for the LegalDocML
standard, and it is a quick fix for referring to a line's end, the
<eol/> tag could also take additional attributes to hold the data for
the beginning and endpoints of the physical line. This can work whether
the physical copy is paper or PDF or not.

An advantage of the PDF is that there could also be an attribute that
points to the URL fragment for the line in the document. Whether PDF is
just photographs of handwritten documents or generated from XML, this
could be a great resource.

This same system would allow for legal transcripts of audio or video to
have similar beginning and end points embedded into the XML. (timecode
is similar to lines and pages, as it is an artifact of the physicality
of the medium). Media fragment URLs is a standard that W3C has been
working on.

Again, I will create figures that might help to understand this
methodology and data structure. But it deals much more cleanly with the
representation of documents, yet can work within the current solution
proposed by Fabio.

Daniel Bennett

On 2/10/2014 11:01 AM, Fabio Vitali wrote:
> Dear all,
> there has been in the past week an on-and-off discussion on the semantics and syntax of elements <eop> and <eol>.
> Currently the semantics of <eop> and <eol> are to identify the end of page (and respectively end of line) as markers, i.e., by their mere presence at a given position. Therefore, they are placed nearest the end of the page and line according to the reference copy in printed form, and there is the assumption that the corresponding beginning of page and line are immediately after the previous <eop> and <eol> elements.
> Additionally the syntax of <eop> and <eol> allows to add two attributes. The first parameter is breakat, that specifies the number of characters within the next word that the page (or line) actually breaks at. This allows in-word bearks to happen even if the word is not actually broken in the XML. The second parameter is number, which allows to specify a page number or line number for the element, especially if we did nit start at zero (which may happen if the document belongs to a container, e.g. a gazette or something).
> Daniel Bennett has objected that a mechanism to identify the start of the page/line, as well as its end, is appropriate, and that for it one could use some form of indirect reference, such as the use of an XPointer or another pointer-like syntax to be determined.
> The addition of yet another attribute is not a great complexity, and can be done quite easily. I am only perplexed about its need. The justification for such attribute, is for situations in which the beginning of the page/line is NOT immediately following the end of the previous page/line, but may happen somewhere else. I have not yet met such a situation. If there is a situation is which this happens, then I am more that happy to add the new attribute, but I would appreciate an example.
> So if you do have such an example, please share it with the rest of the group, and we will make sure to have a new attribute in these elements.
> Thanks and ciao
> Fabio
> --
> Fabio Vitali Tiger got to hunt, bird got to fly,
> Dept. of Computer Science Man got to sit and wonder "Why, why, why?'
> Univ. of Bologna ITALY Tiger got to sleep, bird got to land,
> phone: +39 051 2094872 Man got to tell himself he understand.
> e-mail: fabio@cs.unibo.it Kurt Vonnegut (1922-2007), "Cat's cradle"
> http://vitali.web.cs.unibo.it/
> ---------------------------------------------------------------------
> To unsubscribe from this mail list, you must leave the OASIS TC that
> generates this mail. Follow this link to all your TCs in OASIS at:
> https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]