xliff-comment message

Subject: Re: [xliff-comment] XLIFF vs. PO vs. Trolltech
From: Asgeir Frimannsson <asgeirf@redhat.com>
To: xliff-comment@lists.oasis-open.org
Date: Sat, 17 May 2008 14:39:32 +1000
Hi Oswald, TC members,

(this reply does not by any means represent the view of the XLIFF TC)

On Saturday 17 May 2008 03:05:11 am Oswald Buddenhagen wrote:
> Trolltech is looking into implementing/improving XLIFF support in Qt's
> Linguist tool chain. Interoperability with PO files is an item, too.
> This is what I've come up with. Please sanity-check it, so we don't set
> a faulty de-facto standard in case we go for it. ;)

In terms of PO interoperability and the representation of TS in PO, it would 
probably be wise to discuss this on the GNU gettext mailinglist 
(bug-gnu-gettext@gnu.org) see http://savannah.gnu.org/projects/gettext/ . 
Also, the Translate Toolkit (translate.sf.net) have some existing ts<->po 
converters but I'm not sure what the status of these are.

> - The PO representation guide says that everything should be put into one
>   <file> element and PO references should be represented as <context
>   context-type="sourcefile">. This is in accordance with the XLIFF spec
>   (see "sourcefile" value doc). However, that means that if I create an
> .xlf file directly from sources I get a different representation than if I
> create a .po file and convert it to .xlf later. I find this inconsistency
> not justified, so I think I would opt for the "native" representation with
> multiple <file> elements. Only if the PO message has additional references
> to other files, sourcefile contexts would be used.

The main issue with representing this as multiple <file> elements is that in 
XLIFF, there is no concept of meta-data above the <file> level. We used a 
single <file> element for representing a PO, as a PO is a single file. If 
e.g. gettext implemented support natively for XLIFF, the data model would be 
very different, as the source would be a set of source-files with extracted 
translatable text, rather than a single resource file. 

(this might be a bit Qt/Trolltech specific from here:)

From what I understand from your mail you are trying to accomplish something 
like

# generates a single .xlf for the project with mutiple <file> elements
lupdate -xlf myproject.pro

# generates a single .po for the project
lupdate -po myproject.pro

# generates a single .ts for the project
luptdate -ts myproject.pro 

So you are saying that if you take the PO generated above and create an XLIFF 
from it using the representation guide, it will be different from the XLIFF 
created by lupdate directly? If so, I don't see anything wrong with that, as 
they are technically representing two rather different data-models. 

As a side-note: In some of my work, I've found it more beneficial to represent 
PO files as a hierarchy of <group> elements based on the PO references rather 
than the flat structure we have defined in the PO representation guide. This 
structure gives a much better contextual hierarchy for both translators and 
processing tools. This approach takes more processing though, as you have 
inter-trans-unit references, and the PO would have to be fully read before 
starting to write the XLIFF file. Howerver, you might find this 
representation closer to what you're trying to accomplish, although I'm not 
sure how it matches with the ts <context> element.

PO:
#:src/MyDialog.cpp:23 src/MyOtherDialog.cpp:12
msgid "Hello World"
msgstr ""

XLIFF representation:
<group restype='x-directory' resname='src'>
  <group restype='x-file' resname='MyDialog.cpp'>
    <trans-unit id='1'>
      <source>Hello World</source>
    </trans-unit>
  </group>
  <group restype='x-file' resname='MyOtherDialog.cpp'>
    <trans-unit id='2' translate='no'>
      <source><ph id='x' xid='1'/></source>
    </trans-unit>
  </group>
</group>

> - Gettext's new msgctxt keyword was brought up before. Incidentally, the
>   <comment> element in Qt's own .ts files maps pretty well to it. There
>   is no standardized mapping for .xlf yet, though. I would pick up a
>   previously suggested approach and do it like that:
>
>       <trans-unit>
>         <source>foobar</source>
>         <target>irgendwas</target>
>         <context-group purpose="match information">
>           <context context-type="x-gettext-msgctxt"
> match-mandatory="yes">some context info</context>
>         </context-group>
>       </trans-unit>
>
>   For plural forms, the context would be attached to the plural group.
>   The exact value for purpose= is not clear to me - the values suggested
>   seem to refer to TM only. I think I would simply skip the purpose ...

Translator editors can e.g. display the context to the translator only 
if 'purpose' is set to 'information', and hide it otherwise. Similarly, a TM 
processor can chose to perform additional 'context matching' based on the 
the 'match' purpose-value. This would e.g. be useful if you had two identical 
translation units, but with different contexts, and the TM processor could 
automatically match better based on these.

> - .ts files know a <context> element. I consider it stronger than msgctxt:
> it is not optional; every message is in a context. Therefore I would map it
> to nested groups:
>
>       <group restype="x-trolltech-ts-context">
>         <context-group purpose="match information">
>           <context context-type="x-trolltech-ts-context"
> match-mandatory="yes">the
> context</context>
>         </context-group>
>         <trans-unit .../>
>       </group>
>
>   FWIW, the mapping to PO would be via a magic extracted comment:
>   #. ts:context <the context>

This sounds sensible to me. 

> - As the repr. guide says, .po files do not encode the (target) language.
>   Therefore I would add an X-Language: header to the initial msgstr. It
> would be implanted and extracted during conversion. When converting from an
> .xlf file which does not have a first message that seems to be a .po file
> header, a message would be generated and marked with X-Virgin-Header:; if
> this header is found on converting back, the message would be zapped.

Not sure I understand the use-case for this. 

> - Gettext's #| msgid (previous source in fuzzy translation) would be mapped
>   to <alt-trans> elements as suggested on this list before: Each previous
>   source is tacked onto a current source. If more previous sources than
>   current sources exist (plural to singular "downgrade"), the source gets
> two alt-trans elements, the second one with an empty target marked with
> restype="x-dummy".
> - Gettext's #| msgctxt would get mapped just like msgctxt, only that the
>   context-type would be x-gettext-previous-msgctxt.
>
> - Contrary to the guide, I would store obsolete messages, marking the
>   <trans-unit> resp. the containing plural <group> with translate="no".
>   I see no harm in doing this and it yields a more faithful conversion.
>   The messages would go into a <file> with the imaginary original name
>   Obsolete_PO_entries.

I'm not sure if we really need to go to this extent. I guess it's more a 
design-question if XLIFF was really meant to be a replacement for all 
features that a format supports, rather than an extraction-format. E.g. 
obsolete entries in PO is a way of storing translation that was used in 
previous versions of the project, but are no longer used (however they may 
pop up in later versions of the project, that's why they are stored). XLIFF 
was not intended to be a storage container for these (I guess TMs replace 
this functionality), and I'm not sure if trying to mold XLIFF into such a 
storage container would break processing tools etc (wrong statistics, word 
counts, file counts etc).

> - The guide does not specify how to map fuzzy plurals. I guess one should
>   require approval of all <trans-unit>s in the <group> for non-fuzziness.

Yes, this is a design-limitation of the current XLIFF specification. This 
approach sounds reasonable to me.

> Does this sound OK?
> TIA for any input.

Again, this is my own thoughts, and I'm sure it's healthy to have an open 
discussion on these issues, that's what the comments list is about :) You've 
certainly highlighted a few issues we need to consider more in the next 
version of XLIFF.

cheers,
asgeir
Follow-Ups:
- Re: [xliff-comment] XLIFF vs. PO vs. Trolltech
  - From: Oswald Buddenhagen <oswald.buddenhagen@trolltech.de>
References:
- XLIFF vs. PO vs. Trolltech
  - From: Oswald Buddenhagen <oswald.buddenhagen@trolltech.de>