RE: [lexidma] follow up on two-level senses

Hi MiloÅ and John

Thank you for your notes. I also this this discussion is progressing nicely, but am starting to lose count between the different threads, so if continued would you please update the format of this exchange of views.

Thanks

Ilan

From: lexidma@lists.oasis-open.org <lexidma@lists.oasis-open.org> On Behalf Of John P. McCrae
Sent: Tuesday, July 7, 2020 3:20 PM
To: lexidma@lists.oasis-open.org
Subject: Re: [lexidma] follow up on two-level senses

Resend because of wrong mail address.

Ar MÃirt 7 IÃil 2020 ag 13:17, scrÃobh John P. McCrae <john@mccr.ae>:

Hi Milos,

Okay, I think we are slowly moving into a good direction.

Ar Luan 6 IÃil 2020 ag 12:31, scrÃobh MiloÅ JakubÃÄek <milos.jakubicek@sketchengine.eu>:

Hi John,

thanks for your reply, please see my comments:

On Fri, 3 Jul 2020 at 11:25, John P. McCrae <john@mccr.ae> wrote:

(2) there is very limited agreement on this similarity

Actually, I don't think this is merely similarity that is being encoded, in fact mostly these sense groupings are based on ideas of systematic polysemy (as introduced by authors such as Pustejovsky [1] and Buitelaar [2]) and complementary and contrastive senses (such as described by Weinreich [3]). These are real linguistic phenomenon and still motivate modern electronic lexicographic efforts [4].

Yes, I don't see that contradicting anything I've said. Perhaps a more appropriate term to be used from the computer science perspective is an equivalence relation here, but that doesn't matter and to ease understanding I stick to "similarity", however it is defined and whether it is seen as a binary relation or a real-valued metric.

Actually, my point was that this is not that this is a kind of similarity but that there is a linguistic test that can be used to distinguish subsenses from 'true' (i.e., contrastive) senses.

(3) there are many possible way how this similarity can be defined and seen, allowing this means being closer to how language/word senses work

(4) the fact that it was encoded in a hierarchical way that only allows one-dimensional structure merely comes from the limits of a printed dictionary

I am not sure I agree with this... partly for the reasons stated above, but moreover, users do not want to use an electronic dictionary as some free-form graph structure. This is something that I have learnt from WordNet, that presenting the data as a flat text structure (e.g., https://en-word.net/) is more effective than through a graph diagram. As such, I think in both presentation and production of dictionary content, hierarchical groupings are still very useful.

I totally agree with the first part: users prefer flat structures because they are far more easier to comprehend -- but that's a very valid argument against using any hierarchies, not against using just one hierarchy where many can be valid too.

I meant flat as can be represented in a page-like structure, this certainly includes hierarchies. Or put another way: the fact that we are not limited by the printed page in electronic dictionaries does not mean that some of those restrictions. which we can now discard, are not helpful in many applications, as the computer screen is still just a flat surface for displaying information.

(5) this alternative solution therefore enables all this, and much more, if needed, without introducing additional complexity.

I think that the labels generally could use a similar notation that David mentioned for PoS tagging, with prefix denoting type of label, e.g. "sensegroup:1" or "sensegroup:etymology1" and similar but that is to be discussed.

From a technical point of view, there are also disadvantages to this. You are still encoding hierarchical senses, but now you are doing it in a way that is harder to work with in XPath and many other technologies, which in turn makes it harder for data creators to verify consistency.

I would suggest that this is implemented as an optional sense grouping tag, e.g,

<senseGrp>

<sense id="..."><defn></defn></sense>

<sense id="..."><defn></defn></sense>

</senseGrp>

It would be great if we could avoid any kind of XML thinking in our discussions: we propose a data model. XML will be just one of several serializations for it and once we establish the model we will then discuss the XML serialization, or even XML serializations (i.e. more than one).

Sure, but your modelling seems to suggest an XML attribute or an RDF datatype property.

Having said that, I think it is good to avoid any unnecessary nesting (senseGrp here) whenever it is possible. Typically this will make XPath queries shorter and easier to write and read. But again: I really don't think that choosing one particular query language of one particular serialization should influence the data model design.

Okay, I guess my point is that there should probably be an element for a sense group, given that it is a real linguistic thing and has been a part of many dictionaries for a long time.

Also, I would note that this discussion is only really about grouping senses. Grouping entries is more questionable but is often motivated by linguistic phenomena like derivation, grouping etymologically distinct forms of the same word (e.g., 'bank' can be first grouped into subentries based on its Germanic/Italian/French etymologies) or morphologically distinct forms (e.g., the unique dative singular found in the seventh sense here). We should at least consider these requirements on the representation and have a plan to represent them in the model.

Yeah -- from this perspective I think the proposed approach can be easily generalized for these purposes too.

Yes, perhaps. But whatever the modelling we need examples in the specification to cover these cases.

Regards,

John

Best regards

Milos

lexidma message