OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

xliff message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [xliff] XLIFF 2.0 example files for segmentation


Yves and Rodolfo have already commented, but I see another issue with the example. It is very problematic from a processing standpoint because there is no reliable way to go from the modified version you show (your third XML example) to the expected output (the fourth) because the IDs are changed and remapping them would require linguistic processing based on the content. The only way around it that I see would be to allow nested segments, which is a mess (more below).

But aside from the problem that the IDs and structures no longer match, would a tool really need to do that at all? I suppose if a tool used XLIFF as its native, internal data format, it might be needed. But I don’t believe most tools will do this. Instead they would take your XML example number one and use it to extract something like the following:

Content piece #1:
      First sentence.\nSecond sentence.\nThird\nsentence.

Content piece #2:
      The user {$1}{$2}{$3} deleted file {$4}.  File {$5} cannot be recovered.

The tool would convert these into its own internal segments using its own algorithm when processing them against the TM. Then the tool would reverse the filtering process and put the content back into the original XLIFF structure. I'd no more expect the tool to tinker with the segmentation in the XLIFF file than I would expect the tool to tinker with the segmentation in an InDesign or Word file. What it does internally is up to it, but it should not be in the business of creating some other XLIFF file with different segmentation. The exception would be if a tool were called upon specifically for the job of segmenting an XLIFF file for processing in some tool that doesn't handle its own segmentation, but I suspect that is not a common task.

So I think the further segmentation you are talking about is something that the tool would handle on its own and for which it would not require support in the XLIFF format.

If we assume that the tool doesn’t take this approach of creating a derivative internal representation for internal processing while leaving the XLIFF file intact, then we'd have to invoke much more complicated processes and structures in the XLIFF itself, like supporting nested segmentation, e.g.:

file
|
+--unit("string1")
|  |
|  +--segment ("tool1-1")
|     |
|     +--segment ("tool2-1)
|     |  |
|     |  +--source: "First sentence"
|     |  |
|     |  +--target: "Primera frase"
|     |
|     +--ignorable ("tool2-2")
|     |  |
|     |  +--source: "\n"
|     |  |
|     |  +--target: "\n"
|     |
|     +--segment ("tool2-3")
|     |  |
|     |  +--source: "Second sentence"
|     |  |
|     |  +--target: "La segunda frase"
|     |
|     +--ignorable ("tool2-4")
|     |  |
|     |  +--source: "\n"
|     |  |
|     |  +--target: "\n"
|     |
|     +--segment ("tool2-5")
|        |
|        +--source: "Third\nsentence"
|        |
|        +--target: "La tercera\nfrase"
|  
+--Unit("string2")  
   |
   +--segment ("tool1-2")
      |
      +--segment ("tool2-1)
      |  |
      |  +--source: "The user {$1}{$2}{$3} deleted file {$4}."
      |  |
      |  +--target: "El {$1}{$2}{$3} usuario eliminado {$4} archivo."
      |
      +--segment ("tool2-1)
         |
         +--source: "File {$5} cannot be recovered."
         |
         +--target: "{$5} archivo no se puede recuperar."


So your example, to work and do what you want, would require nested segment bits (or a virtual equivalent like <segment id="1" virtual-id="1"> that would be similarly messy). If we did this, then each tool would be able to recreate its own segmentation from the file by using tool-specific IDs. But that is a level of complexity that I don't see the need for.

I have to admit that I'm a bit confused by the example and the responses. <segment> itself may be very useful, but if tools start playing around with <segment>s as in your example, I think it will lead to all sorts of problems. I would expect <segment>s to be immutable from the file that creates them or the ability to roundtrip the data runs a real risk of being broken. The only obvious way I see around that is to create a nested structure of some sort, and I see that as a real problem, but in the end, is this a realistic scenario? Admittedly, I don't know all tools, but I don't see it as representative of those tools that I do know.

Maybe someone else will see some way around this, however.

-Arle

On Nov 9, 2011, at 10:52 , David Walters wrote:

It is easier for me to understand the situation if I have an example to reference.

Here is a simple Java PropertyResourceBundle file to be used as the original source file.


    string1=First sentence.\nSecond sentence.\nThird\nsentence.
    string2=The user <b>{0}</b> deleted file {1}.  File {1} cannot be recovered.


A product developer might create an extraction program to create this XLIFF 2.0 file.
    Note: I included a "segment" attribute to document how the <source> text was "segmented".

    Without <segment>.
      <?xml version="1.0" encoding="utf-8"?>
      <xliff version="2.0" segment="block">
        <file srclang="en" original="test.properties">
          <unit id="string1">
            <source>First sentence.
      Second sentence.
      Third
      sentence.</source>
          </unit>
          <unit id="string2">
            <source>The user <pc id="1><ph id="2"/></pc> deleted file <ph id="3"/>.  File <ph id="3"/> cannot be recovered.</source>
          </unit>
        </file>
      </xliff>

    With <segment>.
      <?xml version="1.0" encoding="utf-8"?>
      <xliff version="2.0" segment="block">
        <file srclang="en" original="test.properties">
          <unit id="string1">
            <segment id="1">
              <source>First sentence.
        Second sentence.
        Third
        sentence.</source>
            </segment>
          </unit>
          <unit id="string2">
            <segment id="1">
              <source>The user <pc id="1><ph id="2"/></pc> deleted file <ph id="3"/>.  File <ph id="3"/> cannot be recovered.</source>
            </segment>
          </unit>
        </file>
      </xliff>


A translation tool may modify the file to segment the text based on sentences.  The translated file might be the following:

    <?xml version="1.0" encoding="utf-8"?>
    <xliff version="2.0" segment="sentence">
      <file srclang="en" tgtlang="es" original="test.properties">
        <unit id="string1">
          <segment id="1">
            <source>First sentence.</source>
            <target>Primera frase.</target>
          </segment>
          <ignorable id="2">
            <source>
    </source>
            <target>
    </target>
          </ignorable>
          <segment id="3">
            <source>Second sentence.</source>
            <target>La segunda frase.</target>
          </segment>
          <ignorable id="4">
            <source>
    </source>
            <target>
    </target>
          </ignorable>
          <segment id="5">
            <source>Third
    sentence.</source>
            <target>La tercera
    frase.</target>
          </segment>
        </unit>
        <unit id="string2">
          <segment id="1">
            <source>The user <pc id="1"><ph id="2"/></pc> deleted file <ph id="3"/>.</source>
            <target>El <pc id="1"><ph id="2"/></pc> usuario eliminado <ph id="3"/> archivo.</target>
          </segment>
          <ignorable id="2">
            <source>  </source>
            <target> </target>
          </ignorable>
          <segment id="3">
            <source>File <ph id="3"/> cannot be recovered.</source>
            <target><ph id="3"/> archivo no se puede recuperar.</target>
          </segment>
        </unit>
      </file>
    </xliff>

The product developer would expect to get this translated file back after translation, which maps to the version of the XLIFF file which he sent out for translation.


    Without <segment>.
      <?xml version="1.0" encoding="utf-8"?>
      <xliff version="2.0" segment="block">
        <file srclang="en" tgtlang="es" original="test.properties">
          <unit id="string1">
            <source>First sentence.
      Second sentence.
      Third
      sentence.</source>
            <target>Primera frase.
      La segunda frase.
      La tercera
      frase.</target>
          </unit>
          <unit id="string2">
            <source>The user <pc id="1><ph id="2"/></pc> deleted file <ph id="3"/>.  File <ph id="3"/> cannot be recovered.</source>
            <target>El <pc id="1"><ph id="2"/></pc> usuario eliminado <ph id="3"/> archivo. <ph id="3"/> archivo no se puede recuperar.</target>
          </unit>
        </file>
      </xliff>

    With <segment>.
      <?xml version="1.0" encoding="utf-8"?>
      <xliff version="2.0" segment="block">
        <file srclang="en" tgtlang="es" original="test.properties">
          <unit id="string1">
            <segment id="1">
              <source>First sentence.
      Second sentence.
      Third
      sentence.</source>
              <target>Primera frase.
      La segunda frase.
      La tercera
      frase.</target>
            </segment>
          </unit>
          <unit id="string2">
            <segment id="1">
              <source>The user <pc id="1><ph id="2"/></pc> deleted file <ph id="3"/>.  File <ph id="3"/> cannot be recovered.</source>
              <target>El <pc id="1"><ph id="2"/></pc> usuario eliminado <ph id="3"/> archivo. <ph id="3"/> archivo no se puede recuperar.</target>
            </segment>
          </unit>
        </file>
      </xliff>

Are these realistic examples?

David

Corporate Globalization Tool Development
EMail:  waltersd@us.ibm.com          
Phone: (507) 253-7278,   T/L:553-7278,   Fax: (507) 253-1721

CHKPII:                    http://w3-03.ibm.com/globalization/page/2011
TM file formats:     http://w3-03.ibm.com/globalization/page/2083
TM markups:         http://w3-03.ibm.com/globalization/page/2071



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]