OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

xliff message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [xliff] ITS: Preserve space and Language Information


Thanks, Yves, inline..

On Thu, Oct 23, 2014 at 3:13 PM, Yves Savourel <ysavourel@enlaso.com> wrote:

Hi David, all,

 

While in some cases (like multiple spaces between sentences) using <ignorable> with xml:space could be a solution, that can’t solve all use cases, and, as pointed out, that will cause trouble when re-segmenting.

 

The other solution (using inline codes to store spans of white-spaces) looks like asking for troubles: The main reason for such complicated option would be because xml:space can’t be set in <mrk>. It would also not solve the xml:lang case. In general we do not want to encourage using more inline codes.

 

I think the simplest and most comprehensive solution is to have its:space and its:lang defined and behaving just like xml:space and xml:lang, but with the sm-specific scope. That doesn’t preclude anyone to use the other options if they really want to go that road.

I am not strongly opposed to defining its:space and its:lang if it indeed proves the best and simplest solution. I am however far from being convinced it is..
It would be an irony, as they would be introduced for ITS - where non-wellformed spans are currently not an option - to cater for non-well formed span transformations between <mrk> and <sm/>/<em/>.
While this solution looks as the most systematic, I doubt that it is the simplest.
The more ITS categories are using potentially non-wellformed spans with <sm/>/<em/>, the more likely it will be that the ITS data won't make it through the roundtrip, because the equivalence reduction to <mrk> variants will be less likely to succeed.
In case you are restricted to using the needed xml namespace attributes on static structural elements down to unit AND dynamically on <source> and <target> you have one layer of ITS markup with guaranteed wellformedness, so 2 down to worry about while making the <sm/>/<em/> to <mrk> transforms.

In case of terminology, we did say that all terminology is encoded as inline, even though it may apparently exist at structural elements in various source formats.. We said that the use case where the whole element is terminology is not statistically significant to warrant different handling.

The situation is opposite but analogical here. IMHO and AFAIK whitespace handling and language information are inherently structural characteristics when encoding natural language text. and we actually do NOT inhibit expressivity of XLIFF by not introducing the truly inline variants that could possibly be transformed into <sm/>/<em/> pairs.

if you indeed have to introduce different language or differnt sort of whitespace handling on sub-unit level. I don't think that separating such a portion as its own segment or ignorable is unwaranted. If you want guaranteed roundtrip for such a construction you can protect it by the canResegment flag set to "no", which again seems warranted for such a special case.

While I see that the introduction of its:space and its:lang looks systematic and I am fairly confident is doable. I do think that such a solution is an overkill that brings more complexity than is actually warranted by any real life case where you'd need this type of metadata truly inline..

When you need an example or password field or array, or whatever with different whitespace handling, it hardly seems unwarranted to extract it as a different unit or at least a different segment.

Similarly if you are using examples, poems, quotations in a different language, these seem inherently structurally different to the normal text flow in the main source language.
Even if you are using one word examples tightly mixed within the source language, it seems plausible to set them as separate segments that can be e.g. handled by different services/translators. Again I do not see a significant use case for introducing a full blown <mrk><-></sm>/<em/> machinery for this metadata that actually is inherently structural..

I should like to challenge people on both mailing lists (Felix?, Fredrik?) to come up with valid and frequent use cases where structural extraction seems inadequate.

 

It simply means that if you want to handle Preserve Space or Language Information at the inline level, you have to support that part of the ITS module (which is really not complicated when you already have to handle xml:space and xml:lang for the Core).

I do not understand this reasoning. Based on core you need only to support xml namespace attributes through simple inheritance and do not need to worry about analogical semantics on non-well formed spans. So introduction of those new attributes on annotation markers actually does bring a whole new complexity..
Do you remember how complicated it is to determine translatability across non-welformed spans and cross-segment? I think there is a value in avoiding this complexity for xml:space and xml:lang

That means one cannot guarantee those features will be preserved by Core-only processors. But it’s already the case in 2.0.

Do you mean that xml namespace is also allowed on structural extension points? I think it was a bad decision and I was trying to sway it.. Anyways now we are not talking extensibility at higher structural levels but about introducing a new inline complexity through a fully protected module. A wholly different issue. You are trying to introduce an non-xml-like behavior for two xml:namespace attributes (of course their counterparts in the module namespace but anyway), that I'd argue don't really need that, as we would have hard time thinking of valid use cases where use of preserve space or language information actually is not structural. 

 

Cheers,

-yves

 

 

From: Dr. David Filip [mailto:David.Filip@ul.ie]
Sent: Thursday, October 23, 2014 7:04 AM
To: Yves Savourel
Cc: XLIFF Main List; public-i18n-its-ig
Subject: Re: [xliff] ITS: Preserve space and Language Information

 

Thanks, Yves,

 

I was thinking about two possible solutions.

One of them would be as you propose to introduce its attributes that could work with empty markers as span delimiters.

 

Another way would be to use the fact that the two relevant XML namespace attributes are still available on <source> and <target>

Not sure if this is an omission, probably not as we have PR for resegmentation accounting for that.

 

This would be somewhat restrictive but would have the advantage that the related mark up would be always well formed

 

I tried to write up such restrictive solution for Preserve Space in the Current Working draft.

It also notes that you can use originalData to preserve whitespace..

 

I copy paste it here:

 

Preserve Space

Indicates how to handle whitespace in a given content portion. See [ITS] Preserve Space for details.

Structural Elements

 Whitespace handling at the structural level is indicated with xml:space in XLIFF Core and extensions: 

Extraction of preserved whitespace at the structural level

Original:

 

<listing xml:space='preserve'>Line 1

Line 2</listing>

        

Extraction:

 

<unit id='1' xml:space='preserve'>

 <segment>

  <source>Line 1

Line 2</source>

 </segment>

</unit>

        

 

Inline Elements

 It is not possble to use [XML namespace] on XLIFF inline elements. It is advised that mixed Preserve Space behavior is NOT used inline in source formats. In case of extraction of source format inline elements with mixed Preserve Space behavior, it is advised to extract all discernable portions with uniform whitespace handling into different <unit> elements that can have their whitespace handling set independently. 

Whitespace handling can be also set independently for text segments and ignorable text portions within an Extracted unit and for the source ad target language within the same <segment> or <ignorable> element using the optional xml:space attribute at the <source> and <target> elements. However, mixed whitespace handling behavior is not likely to survive Segmentation Modification. So this method is not advised unless the <segment> elements are protected by the canResegment flag value set to or inhrited as no. 

Preserved whitespaces can be also extracted as original data stored outside of the translatable content at the unit level and referenced from placeholder codes. It is importnat to note that the value of the xml:space attribute is restricted to preserve on the <data> element.

Extraction of preserved whitespaces as referenced original data

Original:

 

 <p>

   <span xml:space='preserve'>Item 1      Item 2      Item n+1 

   </span> are all used to build Item n+2.

 </p>

     

Extraction:

 

<unit id='1'>

  <originalData>

    <data id="d1">&lt;span xml:space='preserve'></data>

    <data id="d2">&lt;/span></data>

    <data id="d3">      </data>

    <data id="d4"> 

    </data>

  </originalData>

  <segment>

    <source><pc id="1" dataRefStart="d1" dataRefEnd="d2">Item 1<ph id="2" dataRef="d3">Item 2<ph id="2" dataRef="d3">Item n+1<ph id="2" dataRef="d4"></pc> are all used to build Item n+2.</source>

  </segment>

</unit>

        

 

Not sure really which solution is better, but I'd say we should explore both..

 

Cheers

dF


Dr. David Filip

=======================

OASIS XLIFF TC Secretary, Editor, and Liaison Officer 

LRC | CNGL | CSIS

University of Limerick, Ireland

telephone: +353-6120-2781

cellphone: +353-86-0222-158

facsimile: +353-6120-2734

 

On Thu, Oct 23, 2014 at 1:41 PM, Yves Savourel <ysavourel@enlaso.com> wrote:

Hi all,

It seems to me that we don't have a good solution for the inline cases of the Preserve Space and Language Information data
categories.

In the original draft mapping we used xml:space and xml:lang on <mrk>.
But, as David pointed out, this can't work because these attributes are not allowed on <mrk>/<sm>.
I believe we did this because of <sm>: both xml:lang and xml:space scopes would apply to an empty element.

But we cannot have no inline solution for those two data categories.
So it seems they would fall into the class of the data categories only partially supported directly by the core, and we need
ITS-module attributes to handle them inline. Something like this: <mrk id='1' type="its:any" its:space="preserve" its:lang="iu">.

Cheers,
-yves





---------------------------------------------------------------------
To unsubscribe from this mail list, you must leave the OASIS TC that
generates this mail.  Follow this link to all your TCs in OASIS at:
https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php

 




[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]