OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

legaldocml message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

Subject: Re: [legaldocml] international languages in URIs

Dear Fabio,

Thanks for the feedback.

On the Issue of using IRIs instead of URIs - thats pretty clear to me now. I am happy to make this proposal in the next TC meeting :-)

On the second issue of using non-english words in ontology classes, I took a deeper look at it and its still not clear to me as to what should be the correct approach. let me give you the scenario -- 

There is legislation in Arabic - it is not in english, has no english and will always be in Arabic. However there is an ontology classification scheme which seems to be primarily in Arabic - but the english edition is being used "visually" - I mean that the system used to present the legislation has a user interface in both english and arabic, and there is an ontology browser provided in both english and arabic (I would say the ontologies have been cross populated - some english terms were translated to arabic and included in the arabic ontology an vice versa ). The ontology browser eventually resolves to a legislation which is in Arabic.

So my question is - which is the appropriate language to use in the ontology of the encoded Arabic XML documents ? the link between the arabic and english terms is maintained in an external system (kind of like eurovoc) ... I presume it should still be english - if the intention is to exploit vocabularies which may be available outside of the arabic domain ? 



On 7 August 2013 11:41, Fabio Vitali <fabio@cs.unibo.it> wrote:
Dear Ashok,

Il giorno 25/lug/2013, alle ore 19:52, Ashok Hariharan <ashok@parliaments.info> ha scritto:

> Is it correct to use non english language scripts in URIs .e.g. the type name or an ontology classification type name specified in a language like Arabic or Tamil ?
> Ashok

sorry for the time it took to answer: I had to check a few things first... The answer is not as simple as it may appear (that is to say, it will not be a plain yes or no).

Background first: there are two separate standards for the syntax of web addresses, i.e., URI (Uniform resource Identifiers, http://tools.ietf.org/html/rfc3986 ) and IRI (Internationalized Resource Identifiers, http://tools.ietf.org/html/rfc3987 ). The first defines the character set of web addresses and references to be limited to (a subset of) US-ASCII characters, and provides mechanisms for encoding within it characters NOT belonging to US-ASCII (namely, through the percent-encoding syntax). The second, on the other hand, defines a syntax that is absolutely equivalent to URI, but over a much larger set of characters, i.e., UCS (ISO/IEC 10644), which includes among others also Arabic and Tamil characters.

The IRI specification also explicitly states that "A protocol or format element should be explicitly designated to be able to carry IRIs." Now, unfortunately, Akoma Ntoso never made such statement, and in fact everywhere we use terms such as URI and URI references, which seems to deny coverage for non-ASCII characters except through percent-encoding (which is cumbersome, unfriendly and frankly very ugly).

So the shortest answer to your question is that NO, it is NOT correct to use non-ASCII characters in ontology classes for Akoma Ntoso documents, simply because we never thought of this situation till now, and IRIs require explicit designation of support.

On the other hand, CEN Metalex ( ftp://ftp.cen.eu/CEN/Sectors/List/ICT/CWAs/CWA15710-2010-Metalex2.pdf ) explicit mentions IRIs as the syntax to be supported for references, and Akoma Ntoso desires to be as compatible as possible to CEN Metalex, and I frankly could not think of a single reason not to support IRIs in Akoma Ntoso.

For this reason, it seems reasonable to me to propose IRIs rather than URIs as the basic model for references in Akoma Ntoso in any of the future teleconfs. I can make this proposal myself, but I think that you should do it. I will certainly vote in favor: it is a small modification and a huge improvement. I will only insist that the text of the Release Notes and internal documentation keep on using the term URI (which seems to me more immediately understandable and identifiable) and only in a footnote some text is added so as to say ("by URIs and URI references we actually refer to IRIs according to RFC 3987").

Once this modification is in place, you will then be authorized to use any character in UCS for your references and ontology classes in Akoma Ntoso documents.


But the next issue, as you seem to imply, is not syntactical, but properly conceptual: is it correct to use non-English words for ontology classes?

Akoma Ntoso systematically uses English for its terms (except for the Akoma Ntoso name itself), but the Naming Convention for the TLC never mentions any human language or character set:

"The URI for non-document entities consists of the following pieces:
§  The base URL of a naming authority with URI-resolving capabilities
§  A detail fragment organizing in a hierarchical fashion the additional data:
   o   The string “/ontology”
   o   The official name of the appropriate TLC
   o   Any number (including none) of slash-separated subclasses of the TLC, as long as they all refer to correct properties of the corresponding instance
   o   The ID of the instance, guaranteed to be unique within the TLC.

All components are separated by forward slashes (“/”) so as to exploit relative URIs in references. "

Thus there are no constraint on using non-English words in ontology classes, including Tamil or Arabic words.

A word of caution is in order, though, in my mind: the whole point of the ontology classes is to allow the association of textual structures of a legal document to the description of a shareable and comparable concept. Thus using a Tamil word (or an Italian word, for that matter) for expressing a concept that exists in English as well seems to me a way to prevent or hinder sharing and comparing of concepts.

For this reason, I would strongly object to using non-English words (Tamil, Arabic, Italian, etc.) for concepts that can be precisely described through English words: it makes no sense to use a Tamil word for "Member of Parliament", "enactment", "first reading", etc., since the English words work perfectly well. Yet, you can and probably should use Tamil and Arabic words whenever no obvious or exact English translation exists.

I hope I was clear and convincing.






Fabio Vitali                            Tiger got to hunt, bird got to fly,
Dept. of Computer Science        Man got to sit and wonder "Why, why, why?'
Univ. of Bologna  ITALY               Tiger got to sleep, bird got to land,
phone:  +39 051 2094872              Man got to tell himself he understand.
e-mail: fabio@cs.unibo.it         Kurt Vonnegut (1922-2007), "Cat's cradle"

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]