OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

dita-translation message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: DITA SC Agenda Monday 12 2006


Agenda for Monday 12 June 2006

Note: Please be certain to review the ITS documents from Yves Savourel before the meeting tomorrow.

I11:00 am - 12:00 am Eastern Standard Team (-5 GMT)

DITA Technical Committtee teleconference

USA Toll Free Number: 866-566-4838

USA Toll Number: +1-210-280-1707

PASSCODE: 185771

Roll Call

Approve Minutes from 6 June 2006 (enclosed for those who are not TC

members)

http://www.oasis-open.org/apps/org/workgroup/dita-translation/

<http://www.oasis-open.org/apps/org/workgroup/dita-translation/>

Returning Business:

1) Discussion item from Yves Savourel

As you may know, the W3C has recently published the Last Call Working Draft for ITS (See [1]) as well as the First Working Draft of a companion document: "Best Practices for XML Internationalization" (See [2]).

[1] <http://www.w3.org/TR/2006/WD-its-20060518/>

[2] <http://www.w3.org/TR/2006/WD-xml-i18n-bp-20060518/>

The second document include examples of how to use ITS with a few specific document types (See the "ITS Applied to Existing Formats"

section). In the next draft we would like to include DITA in that list.

The attached file is a try for the possible default rules to process DITA with ITS. We would appreciate very much if some of you had the time to review it and make sure we have not made any mistake, or forgot anything. For example, I'm not sure if the dir attribute should be there or not. I'm also not sure if we have all subflow elements listed. Maybe we need two rule sets: on for the current version of DITA and one for the upcoming one (although if there is no conflict and a single rule set could be used that would be better).

The specification document [1] should help you understand each element in these rules. The Last Call review for the specification ends on June-30. The Best Practices document will still go through several drafts.

2) Discuss Gershon Joseph's draft of the best practice for legacy TM

Attached to this email for non-TC members.

3Management of conref blocks for translations

Standarized (boilerplate) text is often kept in one or more .dita files used as a source for conrefs across a document set.

All boilerplate content for a language must be stand-alone.
            Boilerplate text must be stand-alone phrases to avoid problems translating
            it into some languages, where it does not fit into the surrounding text.

Depending on the conref target, the conref target should be
            translated before the parent document that refers to the conref
            is translated.

Conreffing to an inline element may result
            in a badly translated phrase with respect to its surrounding content,
            so we should probably be against this. Examples: singular/plural,
            prepositions, acronyms e.g. ABS (antilock breaking system) so if
            you conref to the text itself, the translated text may not read
            correctly.

Action Item: Andrjez will provide examples to the group for discussion.

4) XLIFF transforms

Discuss plans for Rodolfo's tests of their XLIFF transforms and possible release to open source. Ask if there is a proposed date.

Andrzej and Rodolfo have successfully converted DITA to XLIFF and back.
        Rodolfo plans to publish their converter as open source.

New Business:

4) Handling multi-language documents

Charles Pau and others to provide examples to the list for discussion

 
 

JoAnn T. Hackos, PhD
President
Comtech Services, Inc.
710 Kipling Street, Suite 400
Denver CO 80215
303-232-7586
joann.hackos@comtech-serv.com

 

 
--- Begin Message ---
Title: [dita-translation] DITA Translation Subcommittee Meeting Minutes: 5 June 2006


Best Regards,
Gershon

---
Gershon L Joseph
Member, OASIS DITA and DocBook Technical Committees
Director of Technology and Single Sourcing
Tech-Tav Documentation Ltd.
office: +972-8-974-1569
mobile: +972-57-314-1170
http://www.tech-tav.com

DITA Translation Subcommittee Meeting Minutes: 5 June 2006

(Recorded by Gershon Joseph <gershon@tech-tav.com>)

The DITA Translation Subcommittee met on Monday, 5 June 2006 at 08:00am PT
for 60 minutes.

1.  Roll call

    Present: Kevin Farwell, JoAnn Hackos, Gershon Joseph, Charles Pau, Rodolfo 
             Raya, Felix Sasaki, Yves Savourel, David Walters, Andrzej Zydron,
             Kara Warburton

    Regrets: Don Day

2.  Accepted the minutes of the previous meeting.
    http://lists.oasis-open.org/archives/dita-translation/200605/msg00016.html
    Moved by Rodolfo, seconded by Yves, no objections.

3.  Returning Business:

3.1 Discussion item from Yves Savourel

    "As you may know, the W3C has recently published the Last Call Working Draft 
    for ITS (See [1]) as well as the First Working Draft of a companion 
    document: "Best Practices for XML Internationalization" (See [2]).
    [1] http://www.w3.org/TR/2006/WD-its-20060518/
    [2] http://www.w3.org/TR/2006/WD-xml-i18n-bp-20060518/

    The second document includes examples of how to use ITS with a few specific 
    document types (See the "ITS Applied to Existing Formats" section). In the 
    next draft we would like to include DITA in that list.
    
    The attached file is a try for the possible default rules to process DITA 
    with ITS. We would appreciate very much if some of you had the time to 
    review it and make sure we have not made any mistakes, or forgotten anything. 
    For example, I'm not sure if the dir attribute should be there or not. 
    I'm also not sure if we have all subflow elements listed. Maybe we need 
    two rule sets: one for the current version of DITA and one for the upcoming 
    one (although if there is no conflict and a single rule set could be used 
    that would be better).

    The specification document [1] should help you understand each element in 
    these rules. The Last Call review for the specification ends on June-30. 
    The Best Practices document will still go through several drafts."

    ACTION for everyone to review the ITS proposals for discussion next week.

3.2 Discussion item from Andrzej Zydron

    "LISA OSCAR's latest standard GMX/V (Global Information Metrics eXchange
    - Volume) has been approved and is going through its final public comment 
    phase. GMX/V tackles the issue of word and character counts and how to 
    exchange localization volume information via an XML vocabulary. 

    GMX/V finally provides a verifiable, industry standard for word and 
    character counts. GMX/V mandates XLIFF as the canonical form for word 
    and character counts.

    GMX/V can be viewed at the following location:
    http://www.lisa.org/standards/gmx/GMX-V.html

    Localization tool providers have been consulted and have contributed to 
    this standard. We would appreciate your views/comments on GMX/V."

    Andrzej gave an overview of the standard and background, and requested
    SC members review the standard.

4.  New Business:

    Decide the Best Practices that we need to consider.

    1)  Possibly to maximize usage of conref (reusable blocks)...

        From Nancy Harrison:
        "Boilerplate text is often kept in one or more .dita files used as a 
        source for conrefs across a document set. How should authors / 
        implementers / processors deal with multiple sets of boilerplate files 
        automatically?  DocBook names every file containing generated text 
        with a language extension (two letter only), including English.  A 
        similar scheme, but probably with locale, not just country, would work 
        for DITA documents as well."

        Andrzej: All boilerplate content for a language must be stand-alone.
            Boilerplate text must be stand-alone phrases to avoid problems translating
            it into some languages, where it does not fit into the surrounding text.

        ACTION: Charles will provide an example of typical boilerplate fragments

        JoAnn: What about a conref to non-boilerplate text? How would this
            affect the translation workflow?
        Andrzej: Dependency on the conref target, which would need to be 
            translated before the parent document that refers to the conref 
            is translated. Again, conreffing to an inline element may result 
            in badly translated phrase with respect to its surrounding content, 
            so we should probably be against this. Examples: singular/plural,
            prepositions, acronyms e.g. ABS (antilock breaking system) so if 
            you conref to the text itself, the translated text may not read 
            correctly.
        ACTION: Andrzej to send examples to the group for discussion.

    2)  Handling multi-language documents
        [we did not discuss this further this week, but some members did send
        examples to the list for discussion on-list and at next week's meting]

    3)  Not a best practice, but the DITA to XLIFF and back mechanism needs to 
        be completed.

        Andrzej and Rodolfo have successfully converted DITA to XLIFF and back.
        Rodolfo plans to publish their converter as open source.

    4)  Gershon: what's the best practice for translations for users who move 
            from legacy documentation system to DITA?

        Andrzej: It should still be possible to run against the previous TM.
            Inlines may not match, or may fuzzy match. As long as memories are 
        aligned at the sentence level, it should work (at least leverage matching)

        Kevin confirmed that using TM as-is will give you 10-20% less matching 
            than if you tweak the XLIFF to better match the DITA.

        Rodolfo: A good TM engine should help you recover 70% of the inline tags,
        which is the main problem.
        
        Kevin: so long as they're matched tags; however conditional text marked 
            up in legacy tools (e.g. FrameMaker) will only be fuzzy matched 
            (at best).

        ACTION Gershon to write a draft proposal (with Rodolfo) and submit it 
            to the list for input and technical assistance.
       
Meeting adjourned at 09:00am PT.

---
--- End Message ---
--- Begin Message ---
Title: First try at the legacy TM best practice

Hi Rodolfo,

Here's what I've come up with. Please add the missing information. If you
prefer to discuss by phone, I've sent you a request to add you as a Skype
contact. My time zone is 7 hours ahead of New York time, or 10 hours ahead
of San Francisco time. I'm not sure how available I'll be today due to other
conference calls I've scheduled, but I should be available tomorrow until at
least 17:00 my time, later if needed.

Best Regards,
Gershon

---
Gershon L Joseph
Member, OASIS DITA and DocBook Technical Committees
Director of Technology and Single Sourcing
Tech-Tav Documentation Ltd.
office: +972-8-974-1569
mobile: +972-57-314-1170
http://www.tech-tav.com

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd";>
<article>
  <title>Best Practice for Leveraging Legacy Translation Memory when Migrating
  to DITA</title>

  <section>
    <title>Statement of Problem</title>

    <para>Many organizations have previously translated content that was
    authored in non-DITA tools (such as Word and FrameMaker). When migrating
    their legacy content into the new DITA authoring environment, what does
    the organization do about their legacy translation memory? This legacy
    translation memory (TM) was created with large financial investment that
    can't easily be thrown away simply because a new authoring architecture is
    being adopted.</para>

    <para>This article describes best practices that will help organizations
    to use their legacy TM for futre translation projects that are authored in
    DITA, in order to minimize the expense of translating DITA-based
    content.</para>
  </section>

  <section>
    <title>Recommended Best Practices</title>

    <para>If you keep the following points in mind, you should be able to
    maximize your existing translation memory when you send your DITA
    documents for translation:</para>

    <itemizedlist>
      <listitem>
        <para>Ensure your translation service providor uses a tool that
        support TMX [is this correct?]. This will ensure you can migrate your
        TM between TM tools that support the industry standard for TM
        interchange. This is required not only to free you from dependence on
        a single translation service provider, but also to allow you to tweak
        your TM to better match your DITA-based XML source documents you'll be
        sending for translation.</para>
      </listitem>

      <listitem>
        <para>Provided the structure of the DITA-based content has not changed
        radically compared to the legacy documents, the TM software should
        fully match most block elements. As long as the legacy TM aligns with
        the DITA source at the sentence level, the translation software should
        be able to fully leverage matching.</para>
      </listitem>

      <listitem>
        <para>Inline elements may not match at all, or may only fuzzy match.
        If the TM is preprocessed to prepare it for the DITA-based translation
        project, then inline elements should fully match. Note that a good TM
        engine should help you recover 70% of the inline tags, which is the
        main area where matching is prone to fail.</para>
      </listitem>

      <listitem>
        <para>If text entities and/or conrefs are used as containers for
        reusable text (and they should be!), then these items may not fully
        match (only fuzzy match). However, since each of these items needs to
        be translated only once, and should at least fuzzy match, it should
        not result in significant translation expense.</para>
      </listitem>

      <listitem>
        <para>If you tweak the XLIFF (exported from the legacy TM to better
        align it with the new DITA content), you should realize an improvement
        of 10-20% on TM matching. Whether it's worth the effort and expense in
        doing this depends on the size of the DITA documents to be translated.
        The idea is [Rofolfo, please correct me if I'm wrong!] to export the
        TM to XLIFF, process the XLIFF (usually via XSLT) to better align it
        with your DITA content, and then import the XLIFF back into your TM.
        Thus, your TM will now be better aligned with your DITA content, which
        will result in more accurate matching.</para>
      </listitem>

      <listitem>
        <para>One area where things can go wrong is if the legacy content does
        not use matched tags. Since XML uses matched tags, elements may not be
        accurately matched. This is particularly true of conditional text
        marked up in legacy tools (e.g. FrameMaker), where you can expect only
        fuzzy matching at best, or no matching at worst. Depending on how much
        conditional contenct the legacy source documents contain, it may be
        worth preprocessing the TM to ensure all conditional tags are paired.
        [can this be automated, or would a human have to go through the TM or
        XLIFF to close the tags? What should we suggest they do here to
        resolve the issue?]</para>
      </listitem>
    </itemizedlist>

    <para>[Rodolfo, is there anything I've missed or anything else we should
    add?]</para>
  </section>
</article>
Title: Best Practice for Leveraging Legacy Translation Memory when Migrating to DITA

Best Practice for Leveraging Legacy Translation Memory when Migrating to DITA


Statement of Problem

Many organizations have previously translated content that was authored in non-DITA tools (such as Word and FrameMaker). When migrating their legacy content into the new DITA authoring environment, what does the organization do about their legacy translation memory? This legacy translation memory (TM) was created with large financial investment that can't easily be thrown away simply because a new authoring architecture is being adopted.

This article describes best practices that will help organizations to use their legacy TM for futre translation projects that are authored in DITA, in order to minimize the expense of translating DITA-based content.

Recommended Best Practices

If you keep the following points in mind, you should be able to maximize your existing translation memory when you send your DITA documents for translation:

  • Ensure your translation service providor uses a tool that support TMX [is this correct?]. This will ensure you can migrate your TM between TM tools that support the industry standard for TM interchange. This is required not only to free you from dependence on a single translation service provider, but also to allow you to tweak your TM to better match your DITA-based XML source documents you'll be sending for translation.

  • Provided the structure of the DITA-based content has not changed radically compared to the legacy documents, the TM software should fully match most block elements. As long as the legacy TM aligns with the DITA source at the sentence level, the translation software should be able to fully leverage matching.

  • Inline elements may not match at all, or may only fuzzy match. If the TM is preprocessed to prepare it for the DITA-based translation project, then inline elements should fully match. Note that a good TM engine should help you recover 70% of the inline tags, which is the main area where matching is prone to fail.

  • If text entities and/or conrefs are used as containers for reusable text (and they should be!), then these items may not fully match (only fuzzy match). However, since each of these items needs to be translated only once, and should at least fuzzy match, it should not result in significant translation expense.

  • If you tweak the XLIFF (exported from the legacy TM to better align it with the new DITA content), you should realize an improvement of 10-20% on TM matching. Whether it's worth the effort and expense in doing this depends on the size of the DITA documents to be translated. The idea is [Rofolfo, please correct me if I'm wrong!] to export the TM to XLIFF, process the XLIFF (usually via XSLT) to better align it with your DITA content, and then import the XLIFF back into your TM. Thus, your TM will now be better aligned with your DITA content, which will result in more accurate matching.

  • One area where things can go wrong is if the legacy content does not use matched tags. Since XML uses matched tags, elements may not be accurately matched. This is particularly true of conditional text marked up in legacy tools (e.g. FrameMaker), where you can expect only fuzzy matching at best, or no matching at worst. Depending on how much conditional contenct the legacy source documents contain, it may be worth preprocessing the TM to ensure all conditional tags are paired. [can this be automated, or would a human have to go through the TM or XLIFF to close the tags? What should we suggest they do here to resolve the issue?]

[Rodolfo, is there anything I've missed or anything else we should add?]

--- End Message ---


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]