xliff message

Subject: Processing extension elements
From: "Yves Savourel" <ysavourel@translate.com>
To: <xliff@lists.oasis-open.org>
Date: Tue, 17 Aug 2004 13:32:07 -0600
It seems to me that the fundamental question for extension 
in <source>/<target> is how "generic" tools will be able to 
deal with them, while preserving them.

Here are the of possible extension processings I can think 
of (without worrying about how this would be expressed in 
the XLIFF schema):

#1- The unknown elements and their content are stripped out.

#2- The unknown elements are stripped out, their content
    left part of the <source>/<target>.

#3- The unknown elements are preserved and treated as <g> 
    (or <x/> if they are empty elements).

#4- The unknown elements are preserved treated as <ph> 
    (their content is seen as code).

#5- The unknown elements have some XLIFF-understood 
    indication on how to be treated.


- "generic tool" means a tool that does the minimal 
processing allows by the specifications. It does not known 
any specific extension.

- "as seen by a generic tool" means how the unknown tags 
would be interpreted in memory (regardless how they are 
actually represented) by tools that would not know what 
to do with them.

- There are actually two cases of processing: during merge 
and not during merge. During a merge process the unknown 
elements should be ignored by the generic tool (just like 
an <mrk> element). One has to decide what to do with the 
content: discard it or treat it as part of the text.

Now let's see examples, pros, and cons for each case:


============================================================
#1- The unknown elements and their content are stripped out.
------------------------------------------------------------

The more drastic solution.

Original entry:

<source xml:lang='en'>This is <htm:b>big</htm:b></source>

As seen a generic tool:

<source xml:lang='en'>This is </source>

Saved by a generic tool:

<source xml:lang='en'>This is </source>

Probably not what we want as extensions that would enclose 
the original content become death trap for translatable
text.




============================================================
#2- The unknown elements are stripped out, their content 
    left part of the <source>/<target>.
------------------------------------------------------------

Original entry:

<source xml:lang='en'>This is <htm:b>big</htm:b></source>

As seen a generic tool:

<source xml:lang='en'>This is big</source>

Saved by a generic tool:

<source xml:lang='en'>This is big</source>

A very simple way to deal with unknown tags. But it would 
add un-wanted content if the content of the extension 
elements are really metadata, as shown below.

Original entry:

<source xml:lang='en'>This is 
 <x:def><x:term>big</x:term><x:pron>'big</x:pron></x:def>
</source>

As seen a generic tool:

<source xml:lang='en'>This is big'big</source>

Saved by a generic tool:

<source xml:lang='en'>This is big'big</source>




============================================================
#3- The unknown elements are preserved and treated as <g> 
    (or <x/> if they are empty elements).
------------------------------------------------------------

Original entry:

<source xml:lang='en'>This is <htm:b>big</htm:b></source>

As seen a generic tool:

<source xml:lang='en'>This is <g id='0'>big</g></source>

Saved by a generic tool:

<source xml:lang='en'>This is <htm:b>big</htm:b></source>

This solution would also add un-wanted content if the 
content of the extension elements are really metadata, as 
shown below.

Original entry:

<source xml:lang='en'>This is 
 <x:def><x:term>big</x:term><x:pron>'big</x:pron></x:def>
</source>

As seen a generic tool:

<source xml:lang='en'>This is <g id='0'><g id='1'>big</g>
<g id='2'>'big</g></g></source>

Saved by a generic tool:

<source xml:lang='en'>This is 
 <x:def><x:term>big</x:term><x:pron>'big</x:pron></x:def>
</source>




============================================================
#4- The unknown elements are preserved treated as a <ph> 
    (their content is seen as code).
------------------------------------------------------------

This is John's senario (I think). It works fine if the 
content of all extension elements is metadata.

Original entry:

<source xml:lang='en'>This is big<x:note>blah blah</x:note>
</source>

As seen a generic tool:

<source xml:lang='en'>This is big<ph id='0'>blah blah</ph>
</source>

Saved by a generic tool:

<source xml:lang='en'>This is big<x:note>blah blah</x:note>
</source>

But it does not work for text content inside extension 
elements, as it would be seen as "code".

Original entry:

<source xml:lang='en'>This is <htm:b>big</htm:b></source>

As seen a generic tool:

<source xml:lang='en'>This is <ph id='0'>big</ph></source>
(Code not text --------------------------^ )

Saved by a generic tool:

<source xml:lang='en'>This is <htm:b>big</htm:b></source>




============================================================
#5- The unknown elements have some XLIFF-understood 
    indication on how to be treated.
------------------------------------------------------------

There are two ways to indicate this:
By an XLIFF-defined attribute the extension elements would 
have or by enclosing the extensions in a special new XLIFF 
element such as <extend>.

Original entry:

<source xml:lang='en'>This is 
 <x:def xlf:totrans='yes'><x:term>big</x:term><x:pron 
 xlf:totrans='no'>'big</x:pron></x:def></source>

As seen a generic tool:

<source xml:lang='en'>This is <g id='0'><g id='1'>big</g>
<ph id='2'>'big</ph></g></source>

Saved by a generic tool:

<source xml:lang='en'>This is 
 <x:def><x:term>big</x:term><x:pron>'big</x:pron></x:def>
</source>

This is more flexible since it allows to specify how to 
process things.

However, it may not always doable as the extension elements
may belong to a namespace that does not allow extension 
itself, so you would not be able to use xlf:totrans (or 
wahtever flag decided on). For that the solution would be to
use an <extend> XLIFF element as Matt (I think) suggested.
But as you can imagine this would start to make the 
<source>/<target> content rather crowded.




============================================================
Personnal opinion
------------------------------------------------------------

It seems that allowing to extension elements that can have 
either translatable or "code" content in <source>/<target> 
would add a significate cost in processing and complexity, 
while I'm not sure allowing code content (i.e. meta-data) 
would be wise anyway.

I see no big problem with the <html:b>-type of extensions as
 they are simply a more customized way of using <mrk> and 
"generic" tools could probably deal with them without to 
much change in their implementation. So, I tend to like 
solution #3 better (at least for now).

-yves