xliff-inline message

Subject: RE: [xliff] RE: [xliff-inline] Representing information for the starting/ending/standalone parts

From: Yves Savourel <ysavourel@enlaso.com>
To: <xliff-inline@lists.oasis-open.org>
Date: Sun, 9 Oct 2011 07:08:25 -0600

Hi Bryan, all,

Maybe it would be good to summarize again what we have to represent.

We must be able to represent 3 types of inline codes:

- codes that stand alone
- codes that mark the start of a start-end construct
- codes that mark the end of a start-end construct

Note that I'm careful to NOT use HTML/XML terminology like "tag" or "element" because XLIFF must work for any notation.

The "simplest" way to represent those three types of codes is, as Christian and Rodolfo, pointed out, to use a single element with some attributes.

Let's call it <C>:

<C id='1'/>

<C id='2' kind='start'/>text<C rid='2'/>

Note that there is no need for a kind='placeholder' or kind='end' because:
 - if rid is present it has to be a kind='end'
 - if neither rid nor kind are present it has to be a code for a placeholder.


Using such element we can handle the original data as we discussed:

1) Not store the original data at all:

Like the example above.


2) store the original data inside the content of <C>:

<C id='1'>{\object...}</C>

<C id='2' kind='start'>{\b </C>text<C rid='2'>}</C>


3) Store the original data outside the content, using a reference in <C>

<C id='1' nid='d1'/>

<C id='2' kind='start' nid='d2'/>text<C rid='2' nid='d3'/>


That's it. We don't need anything more.
Technically no <pc>...</pc>, or <sc>/<ec> are needed at all.


So first: why <sc>/<ec>?

- Because using the same element for 3 different purpose may be confusing. Having one separate element for each function seems to bring more clarity.

- It also allows to possibly validate a bit more easily: rid exists only in <ec> and no need for an extra kind='start' (or similar) attribute.

- One can know the type of code by looking just at the element name (one operation) instead of having to look also at the presence/absence of attributes. In that scenario it takes three operations to know the type of code: get the element, look for kind, look for rid.
That advantage may look unimportant, but in some contexts it simplifies things a lot. Try for example to find only all the standalone codes in a document using a text editor and a regex search...

So in my opinion adding <sc>/<ec> does not complicate things much, it does not break the paradigm, and does bring some better usability. In other words, the price is minor compared to the few benefits.


Next, what about <pc>?

In our current arsenal <pc> exists only to answer the requirement #17. But its addition does break the paradigm and comes with a some complications:

- It cannot be used when you want to store the original data inside the content (I'll expand on that below).

- It forces us to have two attributes instead of one for any information that exist for both the start and the end part of the code (equivStart, equivEnd, subFlowStart, subFlowEnd, etc.)

- The only good side I can find for it is that it allows a more XML-friendly way to denote a start/end construct in the same segment (requirement #17).


Now, Bryan, on using <pc> as standalone code.

> Exactly! I think we've gone the long way around to 
> make my point. <pc> can be made to work across 
> spanned  segments with proper attributes (example 
> will follow). But using <pc> across segments is 
> as (in my opinion) ill-suited as using <sc>/<ec>
> in a single segment.
> ...
> <seg>These skis are good in <pc fuct='start' 
> id='s1' /> Crud.</seg>
> <seg>Powder and packed powder<pc fuct='end' 
> idref='s1' /> would be better served by another ski</seg>

If you replace <sc> by <pc funct='start/> and <ec> by <pc funct='end'/> we don't have a way to store the original data for span-like codes any more. As we surely cannot have the content of <pc> be sometimes text and other time the original data. That would be utterly confusing:

<seg><pc id='1' funct='start'>{\b</pc> text</seg>
<seg><pc rid='1' funct='end'>}</pc> text</seg>
<seg><pc id='2'>text</pc></seg>

We could certainly morph <sc>/<ec> into <ph> as illustrated at the top of the email (<C> is <ph> then). But, none of that makes <pc> any better: it still brings complexity.


When we dig deeper, we can see there is really one thing that causes all this drama: The fact that we decided to not store the original data in attributes. I still think this is the proper choice because of what I understand of the attribute value behavior.
But we should make sure we have a good rational to justify that choice. Do we absolutely know for sure that it is a problem to store large and/or data with line-breaks in attributes?


Cheers,
-ys

References:
- RE: [xliff-inline] Representing information for the starting/ending/standalone parts
  - From: "Lieske, Christian" <christian.lieske@sap.com>
- RE: [xliff-inline] Representing information for the starting/ending/standalone parts
  - From: "Lieske, Christian" <christian.lieske@sap.com>
- RE: [xliff-inline] Representing information for the starting/ending/standalone parts
  - From: "Schnabel, Bryan S" <bryan.s.schnabel@tektronix.com>