Fwd: Call for Participation in the ParlaFormat Workshop

Inizio messaggio inoltrato:

Da: Fabio Vitali <fabio.vitali@UNIBO.IT>
Oggetto: Re: Call for Participation in the ParlaFormat Workshop
Data: 19 febbraio 2019 19:54:14 CET
A: <TEI-L@LISTSERV.BROWN.EDU>, "Tomaž Erjavec" <tomaz.erjavec@IJS.SI>
Cc: Fabio Vitali <fabio.vitali@UNIBO.IT>

Dear TomaÅ,

I do not usually follow the TEI lists, even though I'm familiar with the vocabulary and somewhat with its community. I have been pointed to this thread by participants to the lists that thought I might be interested in commenting. And actually I am.

My name is Fabio Vitali and I am one of the main technical contributors to Akoma Ntoso, the OASIS standard for the XML representation of legal and legislative documents including acts, bills, parliamentary documents such as debate reports and hansards and judiciary documents such as judgments. This mail is both personal and on behalf of the OASIS LegalDocML Technical Committee, which I co-chair.

The main gist of my message is to tell you that the Akoma Ntoso community has read about your effort in digitizing the debate reports of the Sloven parliament and has found it commendable and interesting and well-thought out. Nonetheless, we strongly suggest to reconsider your decision to base your digitization effort on TEI and refocus your choices around the more specific features available within Akoma Ntoso.

I wrote a fairly long and detailed justification for this opinion in the rest of the message, but to keep it short and to the point, I believe there is very little sense in having a separate community for parliamentary debates arise to create separate and incompatible document sets based on different vocabularies and sensibilities, that will never have a chance to interact and be compared to the ones created by the Akoma Ntoso community. It would be much more fruitful, and cheap, to create bidirectional scripts and converters to allow TEI-based scholars to access and use their tools on basic Akoma Ntoso documents, a task that any TEI expert will find refreshingly simple and straightforward if he/she ever manages to look into Akoma Ntoso a little deeper.

---

For those who want to go further than the TL;DR above, here is my full answer.

You mention three reasons not to adopt Akoma Ntoso. I will first object to them:

a) CLARIN centres offer a lot of different types of corpora, and it would be nice
(we can always dream) if they all could be encoded to a common schema,
rather than a different one for each type of text,

Agreed. And for the specific needs of scholarly tasks for which CLARIN is rightly famous, a TEI-based encoding could be considered appropriate and adequate. Yet, legal and legislative documents have a number of fairly different uses and requirements, which require careful consideration and support, and for which a naive and superficial TEI representation may not suffice. This is not to say that TEI cannot represent these requirements fully (I know it can!), but that you would have to study them carefully and across countries, traditions and habits, and come up with tweaks and adaptations of the basic TEI vocabulary in order to express them, thereby generating a smaller community of practice in which a locally known sub-dialect of TEI is the required competency for participation. Scholars desiring to enter such community of practice will be able to employ their generic TEI competencies only for the first, minimal aspects of the encoding, and then will have to learn and understand a large number of ad-hoc choices and adaptations of the basic TEI language whose effort is probably at par with learning Akoma Ntoso. Given the sheer complexity of the semantics of the documents, I am sure that a complicated subdialect is probably unavoidable.

b) such corpora are
typically linguistically annotated (PoS tagging, lemmatisation, NER,
maybe syntax) and, as far as I know, AN does not make provisions for
such annotation, whereas TEI of course does,

Named-entity-recognition has been part of the Akoma Ntoso vocabulary from day one, as it is extremely important to be able to identify precisely the individuals, organizations and concepts being mentioned in legal and legislative documents. In fact, AKN adds to that the temporal and jurisdictional contextualization of documents, both with temporal annotations within the representation of speech (element <recordedTime>) as well as the use of FRBR as the conceptual model to group together families of different and independent text flows because of temporal, linguistic or content diversities, which has been one of the main strength of Akoma Ntoso from the beginning. Thus the same document can be presented in Akoma Ntoso under different versions, language variants, omissions and anonymizations, etc. and be transparently used according to needs, desired language, access rights and time of visit.

Grammatical characterization of the sentences (PoS, lemmatisation, syntax) was on the other hand not among the requirements of Akoma Ntoso, but AKN is sufficiently extensible to allow for it. Consider your example on section 3.2 of http://lrec-conf.org/workshops/lrec2018/W2/pdf/4_W2.pdf :

<s>
<w lemma="2." ana="msd:Mdo">2.</w><c> </c>
<w lemma="verifikacija" ana="msd:Ncfsn">Verifikacija</w>
<c> </c>
<w lemma="mandat" ana="msd:Ncmsg">mandata</w>
<c> </c>
<w lemma="v" ana="msd:Sl">v</w><c> </c>
<w lemma="zbor" ana="msd:Ncmsl">zboru</w>
<pc ana="msd:Z">.</pc>
</s>

This can be rendered in Akoma Ntoso simply as:

<inline name="s">
<inline name="w" lemma="2." ana="msd:Mdo">2.</inline>
<inline name="c"> </inline>
<inline name="w" lemma="verifikacija" ana="msd:Ncfsn">Verifikacija</inline>
<inline name="c"> </inline>
<inline name="w" lemma="mandat" ana="msd:Ncmsg">mandata</inline>
<inline name="c"> </inline>
<inline name="w" lemma="v" ana="msd:Sl">v</inline>
<inline name="c"> </inline>
<inline name="w" lemma="zbor" ana="msd:Ncmsl">zboru</inline>
<inline name="pc" ana="msd:Z">.</inline>
</inline>

providing you with the exact semantic and the same precision needed for your linguistic applications, without loosing track of the legal and juridical characterization that Akoma Ntoso was designed for, and that you'd have to find a special adaptation for when using plain TEI.

It is also worth mentioning that in Akoma Ntoso that "<w lemma="2." ana="msd:Mdo">2.</w><c> </c>" you placed at the beginning would look fairly strange and suspicious, and would be considered neither appropriate nor part of the following sentence in any form. Most probably (if I understand correctly the meaning), you would be required to characterize it structurally as the heading number of a section titled "Verifikacija mandata v zboru" ("verification of the assembly's mandate"). In Akoma Ntoso we would therefore mark it as follows:

<proceduralMotions eId="XXX">
<debateSection name="VerificationOfMandate" refersTo="#verificationOfMandate" eId="YYY">
  <num>2.</num>
  <heading>Verifikacija mandata v zboru.</heading>
  <speech> ... </speech>
</debateSection>
</proceduralMotions>

, and thus a much more sensible and appropriate markup including semantical and structural in addition to grammatical characterization would be something like:

<proceduralMotions eId="XXX">
<debateSection name="VerificationOfMandate" refersTo="#verificationOfMandate" eId="YYY">
  <num>2.</num>
  <heading>
    <inline name="s">
      <inline name="w" lemma="verifikacija" refersTo="#msd:Ncfsn">Verifikacija</inline>
      <inline name="c"> </inline>
      <inline name="w" lemma="mandat" refersTo="#msd:Ncmsg">mandata</inline>
      <inline name="c"> </inline>
      <inline name="w" lemma="v" refersTo="#msd:Sl">v</inline>
      <inline name="c"> </inline>
      <inline name="w" lemma="zbor" refersTo="#msd:Ncmsl">zboru</inline>
      <inline name="pc" refersTo="#msd:Z">.</inline>
    </inline>
  </heading>
  <speech> ... </speech>
</debateSection>
</proceduralMotions>

As you see, in a few lines we have examples of the semantico-structural organization of the document (<proceduralMotion>, <debateSection>) as well as the purely structural one (<num>, <heading>, <speech>) as well as the NER model (the @refersTo attributes pointing to Top Level Class instances in the <references> section which then point to external entities in your favorite ontologies) which shows how all levels can coexist and create a reasonably consistent markup.

An additional note: the speech model of plain TEI, being designed for theatrical plays, is appropriate but incomplete to manage the full complexity in the identification of speakers in a debate in a legal assembly. Akoma Ntoso, in addition to a speaker and an addressee, allows to specify the role under which the speaker speaks.

Two examples: the role of Chair is passed from individual to individual during a long debate session, thus the annotations marked "Chair" need to be attributed over time to different individuals (with completely different manners, sensibility, vocabulary, objectives, etc.). Similarly, during a Question/Answer session to members of the Executive Power, the Question will be posed to the Ministry, but the Answer might be given by some representative of the Ministry, so it is important to record both the identity of the individual providing the Answer, as well as the mandate and authority he/she is speaking under.

Both the individual and his/her role must be recorded. Simply recording either will leave an incomplete and ultimately misleading markup.

And do not forget votes, the most important information to come out of debate reports, and their correct attribution (if and when possible) to the voters who cast it, in order to generate the fundamental analysis of these reports, the behaviour and voting record of each MP, their alignment or disagreement with their party, with their constituents, with their jurisdiction's interests. Akoma Ntoso provides specific and detailed mechanisms to record and evaluate this kind of information, and missing them would leaving the largest and most important piece of information out of the XML encoding.

and, maybe, c) the
parliamentary transcripts for many countries still have to be obtained
by scraping them from the web, say in HTML or PDF, meaning that only
very basic structure can be automatically inserted into the document; I
understand that AN does make provisions to encode only core elements,
still, even that might be too much to expect from such conversions -
however, I could be wrong here.

This is a couple of sentences I very much disagree with. If the whole point of this exercise is to generate a completely equivalent document in TEI to the original one in HTML or PDF, I would very much prefer staying with the HTML or PDF: printing and editing them would be much easier and the potential audience much wider. Turning documents into XML makes sense if they become recipient of a much richer, sophisticated and appropriate set of semantical, structural and descriptive data. Scraping thus needs to be only a first step, but further processes need to be put in place to enrich the data, either automatically or through the help of domain experts. And if we agree with this, then we must also agree that the domain experts and the automatic processes must be able to find within the vocabulary everything they may want to express, and therefore a rich domain-specific vocabulary is preferable to a general-purpose one. This is exactly the reason we switch to TEI for, say, medieval manuscripts instead of sticking to HTML.

In fact, I am convinced that the same reasons you cite in your documents for preferring TEI over Akoma Ntoso for legal and legislative documents could be used almost word by word for preferring HTML over TEI for scholarly documents. This is ground for some deep reflections!

--

One additional, but important aspect. You seem interested in converting individual documents into some form of XML in order to do some automatic processing on their content. While this is a noteworthy task and objective, there is more to this that needs to be covered for a really useful and appropriate encoding of important documents such as parliamentary debates. Legal and legislative documents differ from traditional scholarly documents in a few relevant aspects.

First: they are live documents, i.e. documents that ARE BEING CREATED RIGHT NOW and that undergo complex and important transformations during their useful life. Bills are rapidly evolving, acts are modified, debate reports have parts rephrased or removed, judgments have parts anonymized, etc.). You have to deal with their continuous evolution and change in a conscious and controlled way.

Second: they are documents WITH POWER, i.e. documents whose mere existence affects directly the life and choices of thousands of people under their jurisdiction. Parliaments create laws, not rules. Citizens follow laws, not governments. Judges emit judgments, not decisions. Strangely enough, it is not governments that have power, but the documents they create have all the power. Thus recording what documents say, and how they evolved, and how they relate to the powers that gave them existence, must be done with precision and care, not because of some abstract idea of scholarly sophistication, but because, literally, lives can be affected and influenced by how we do the work.

Finally, these documents do not exist in isolation, but are heavily interconnected in a deep weave of implicit and explicit references. Debate reports provide representation of questions that were explicitly tabled in previous sessions and answered formally by members of the government based on references to previous discussions, tabled documents, legal references, etc. Debate reports discuss about bills that evolve through the impact of other bills, of amendments formally tabled at the Chair, and of oral discussions during the live debates. These bills will then become acts that will enter the legal system of the country and impact the actions of the executive power and the judgments of the judiciary power. Interestingly, MOST OF THE LAWS discussed in these reports do not generate FRESH acts, but end up as MODIFICATIONS to existing enacted legislation. Consider a debate about a formally tabled amendment for a change in the phrasing of a bill containing modifications to an enacted law: this is a discussion about a proposed modification to a proposed modification of a real document. Thus any debate report from the parliament mentions a huge amount of other documents and provides for their existence, their modifications, their convergence and merge with other documents, their archival and deletion. Recording the exact references to the correct versions and variants of these documents is a fundamental activity of any XML representation of reports, and deciding how to express references is no trivial task (and NO, looking up the URLs of the corresponding bills as published on the parliament web site is DEFINITELY NOT ENOUGH!!!)

This justifies why Akoma Ntoso is NOT just an XML vocabulary, but a combination of three different tools to be used to represent the encoding of whole local, national and international Legal Systems, composed of legal and legislative documents owned by all three powers of a state (executive, legislative and judiciary), and to provide a reliable and trustworthy ecosystem of interconnected, live, powerful documents:

1) An XML vocabulary to encode documents. These documents include drafts (bills under discussion by an assembly), acts (enacted laws having power over a jurisdiction), working documents of the drafting process (debate reports, hansards, orders of the day, amendments, amendment lists, questions and answers, etc.) and working documents of the enacted documents (official publications, official gazettes, modification acts, errata corrige, as well as judgments by the courts and, soon, court documents of all kinds).

2) A naming convention to provide with TIME-, LANGUAGE-, CONTENT- and PROVENANCE-specific URIs for such documents, such that references can be done to time-specific or time-independent, language-specific or language-independent, content-specific or content-independent, provenance-specific or provenance independent versions of some legal documents, so as to full represent those nuances in references that lawyers consider totally obvious and natural but that plain web URLs of government web sites simply cannot even begin to support. This is where FRBR is crucial and heavily used.

3) The Top Level Classes, a mechanism to associate pieces of information (either in the text content or in the metadata blocks) to classes, individuals and properties of your favorite ontologies, in a model that resembles and subsumes Named-Entity-Recognition but brings it much further on into a complete and bidirectional relationship between text-oriented documents and fact-oriented collections of semantical data.

All three elements must be understood and used in order to provide a decent and faithful representation of the legal, historical and social aspects of legal and legislative documents. Correctly generating not only the XML but also the interconnections between documents is necessary to track the connections between enacted laws and their preparatory documents, such as bills, debate reports and orders of the day, a much needed mechanism to reconstruct the whole legislative process and fundamental to increase transparency, accountability and citizens' participation to democracies.

Simply representing their structure (or grammatical constructs) is just the beginning of the work.

The Akoma Ntoso set of standards is now more than twelve years old and derives directly from a number of previous national (Italy, California, etc.) and international (e.g. UN, EU, etc.) initiatives within local parliaments and courts of justices dated since the late nineties. During the development of these technologies, the specific characteristics of more than twenty different local juridical and legislative traditions from four continents have been thoroughly examined and tested on several different aspects, most of which ended up influencing several subtle aspects of the language. Akoma Ntoso is now adopted by numerous parliaments and international normative bodies, first of which UN and EU, and their decision to adopt Akoma Ntoso rather than other XML vocabularies have much to do with its richness and completeness.

This effort was not done in isolation: we used a large number of antecedent languages as source of inspiration for the development of Akoma Ntoso, and thoroughly considered a many of their features for our purposes, and guess what, TEI was of course one of the main ones. Many of the features you love of TEI are already available within Akoma Ntoso with little effort. I was even part for a short time of the TEI editorial group, although I do not know what was made of the parts that I co-wrote with David Durand about standoff markup. TEI is not a foe for Akoma Ntoso, but a close ally and an important source of inspiration.

But it is important not to deflate everything down to TEI, and thus your sentence

At the very least it would mean
simple import of already encoded AN materials into TEI.

I would rather turn it into the opposite, that it is praiseworthy to have legal and legislative documents expressed in TEI but that it is about time to do the next step and convert them into Akoma Ntoso, which is where they naturally belong.

A last thought: if there is a problem for Akoma Ntoso is the current lack of expert scholars and computer scientists that can navigate through the complexities of semantic and structural capturing, XML representation, XSLT drafting, XML Schema customizations, and so on. TEI experts already master most of what is needed to become proficient Akoma Ntoso experts, and would just need a minor introductory course on the specifics of the legal and legislative domain. There are jobs, projects and some money within the Akoma Ntoso world that TEI experts could decide to consider.

Every year in Ravenna (Italy) in September we have a summer school on legislative XML with a lot of time devoted to Akoma Ntoso and subtleties of the domain. TEI experts would find themselves right at home there and would be able to make the switch, I am exaggerating but not much, in a few hours. Akoma Ntoso is surprisingly easy to understand for a TEI expert.

Please consider the idea of gaining expertise and credibility in not just one XML vocabulary, but in two closely connected yet visibly different ones. We long to have new experts already competent in document representation and XML-based tools, and TEI experts would perfectly fit the requirements.

Best regards and hope to see you in Ravenna

Fabio Vitali

--

Dear Gioele, Andreas,

thanks for your comments and the references. I guess the most honest
answer as to why we decided to use TEI rather than Akoma Ntoso is that
we know and love TEI :). But somewhat more objectively, a) CLARIN
centres offer a lot of different types of corpora, and it would be nice
(we can always dream) if they all could be encoded to a common schema,
rather than a different one for each type of text, b) such corpora are
typically linguistically annotated (PoS tagging, lemmatisation, NER,
maybe syntax) and, as far as I know, AN does not make provisions for
such annotation, whereas TEI of course does, and, maybe, c) the
parliamentary transcripts for many countries still have to be obtained
by scraping them from the web, say in HTML or PDF, meaning that only
very basic structure can be automatically inserted into the document; I
understand that AN does make provisions to encode only core elements,
still, even that might be too much to expect from such conversions -
however, I could be wrong here. But this is not to say that AN is
irrelevant to our proposal and I completely agree that it would be great
to have cross-walks between the two. At the very least it would mean
simple import of already encoded AN materials into TEI.

Also, the workshop is very much meant as a forum to gather opinions on
the suitability of the TEI proposal and how - and if! - to develop it
further and we are looking forward to participants that have possibly
diverging views on how to go about it. Already we know that a part of
the community is very much in favour of using RDF to encode
parliamentarily data, which I see much more problematic than TEI vs. AN
(and so was very happy to read the recent mails on this list by
Chirstian Chiarcos and others on TEI vs. RDF).

Fabio Vitali The sage and the fool

Dept. of Informatics go to their graves

Univ. of Bologna ITALY alike in this respect:

phone: +39 051 2094872 both believe the sage to be a fool.

e-mail: fabio@cs.unibo.it Where, then, may wisdom be found?

http://vitali.web.cs.unibo.it/ Qi, "Neither Yes nor No", The codeless code

legaldocml message