OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

dita-translation message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: DITA Translation Subcommittee Agenda for 17 July 2006


Agenda for Monday 17 July 2006

11:00 am - 12:00 am Eastern Standard Team DITA Technical Committee teleconference USA Toll Free Number: 866-566-4838 USA Toll Number: +1-210-280-1707

PASSCODE: 185771

Roll Call

Accept Minutes from 10 July 2006 (enclosed for those who are not TC

members)

http://www.oasis-open.org/apps/org/workgroup/dita-translation/<http://ww

w.oasis-open.org/apps/org/workgroup/dita-translation/>

Attached for non-members.

3) Review open action items

 

4) Returning business:

4.1 Discuss submission of the xml:lang and the dir recommendations as Commitee Drafts to the TC.

A Committee Draft is an OASIS deliverable so the publishing requirements are the same :-) Basically the output should resemble the OASIS templates - if we need to create the cover information separately and then prepend to both the HTML and PDF I don't see a problem with that, although having all the content in a single clump is optimum. (from Mary MacRae)

4.2 Discuss Gershon Joseph's next draft of the best practice for legacy TM

4.3 Discuss the paper on Rodolfo's website to be used as a source for best practice document.

4.4 Review the revision of the draft Best Practices for Indexing.

//JoAnn's 2nd draft. not yet complete. Please add comments to the draft.

5. New Business:

5.1 Handling multi-language documents

Charles Pau and others to provide examples to the list for discussion

5.2 Andrjez provided examples of the need to change reusable building block to the group for discussion.

Outstanding action items

ACTION -- Rodolfo to prepare an outline for this best practice document for translating DITA and submit to list for discussion at a future meeting.

JoAnn T. Hackos, PhD
President
Comtech Services, Inc.
710 Kipling Street, Suite 400
Denver CO 80215
303-232-7586
joann.hackos@comtech-serv.com

 

 
Title: Best Practice for Leveraging Legacy Translation Memory when Migrating to DITA

Best Practice for Leveraging Legacy Translation Memory when Migrating to DITA

Gershon Joseph

Tech-Tav Documentation Ltd.

Rodolfo Raya

Heartsome Holdings Pte. Ltd.

Statement of Problem

Many organizations have previously translated content that was authored in non-DITA tools (such as Word and FrameMaker). When migrating their legacy content into the new DITA authoring environment, what does the organization do about their legacy translation memory? This legacy translation memory (TM) was created with large financial investment that can't easily be thrown away simply because a new authoring architecture is being adopted.

This article describes best practices that will help organizations to use their legacy TM for future translation projects that are authored in DITA, in order to minimize the expense of translating DITA-based content.

Terminology

Before we get into the details, let's define the terms used in the localization industry so that subsequent sections will be better understood.

CAT

Computer Aided Translation, which helps the translator translate the source content. CAT tools usually leverage Translation Memory to match sentences and inline phrases that were previously translated. In addition, some CAT tools use Machine Translation to translate glossary and other company-specific terms (extracted from a terminology database).

Matching

The level of accuracy with which CAT tools can match content being translated to the TM. The levels of matching are defined as follows:

Fuzzy matching

The source segment being matched is similar, but not identical to, the source language segment in the TM.

Leveraged matching

The source segment being matched is identical to the matched segment, but the context is not known.

Exact matching

The source segment being matched is identical to the matched segment and comes from exactly the same context.

MT

Machine Translation is a technology that translates content directly from source without human intervention. Used in isolation, MT usually generates an unusable translation. However, when integrated into a CAT tool to translate specific terminology, MT is a useful technology.

TM

Translation Memory is a technology that reuses translations previously stored in the database used by the translation tool. TM preserves the translation output for reuse with subsequent translations.

TMX

Translation Memory eXchange is an industry standard format for exchanging TM between CAT tools.

XLIFF

XML Localisation Interchange File Format is a document format used to exchange translatable content between CAT tools.

Recommended Best Practices

If you keep the following points in mind, you should be able to maximize your existing translation memory when you send your DITA documents for translation:

  • Ensure your translation service provider uses a tool that supports TMX (Translation Memory eXchange). This will ensure you can migrate your TM between CAT tools that support the industry standard for TM interchange. This is important not only to free you from dependence on a single translation service provider, but also to allow you to fine-tune your segmentation rules to better match your DITA-based XML source documents you'll be sending for translation.

  • Provided the structure of the DITA-based content has not changed radically compared to the legacy documents, the CAT software should achieve exact matching on most segments in the TM. As long as the legacy TM aligns with the DITA source at the sentence level, the translation software should be able to achieve leveraged matching for the elements. Good CAT tools break the DITA block elements down into sentence-level segments, which will ensure better matching of the legacy TM. Usually, the DITA content is transformed into XLIFF, which can handle segments at the block or sentence level.

  • Inline elements may not match at all, or may only fuzzy match. If a CAT tool is used to preprocess the TM to prepare it for the DITA-based translation project, then inline elements should yield an exact match. Note that a good TM engine should help you recover 70% of the inline tags, which is the main area where matching is prone to fail.

  • If conrefs are used as containers for reusable text, then these items may not exactly match (only fuzzy match at best). However, since each of these items needs to be translated only once, and should at least fuzzy match, it should not result in significant translation expense. For best practices on using conref elements in DITA documents that need to be translated, please see XREF TO CONREF BEST PRACTICE.

  • When text entities are used as containers for reusable text, it is preferable to use a CAT tool that extracts translatable text from the XML files using an XML parser. The XML parser will insert the content of the text entities into the source text that the translator uses as a reference. This allows the translator to check that the translated segments flow correctly in the target language. If text entities are translated separately from the context where they are used, there may be grammatical inconsistencies in the final text when the translated DITA files are published.

  • You can export the legacy TM to a TMX file. The TMX file is an XML file, which can be manipulated to better align the translation segments with the DITA markup. The modified TMX file can then be converted back into a TM. This new TM will provide more exact matching against your DITA content than the legacy TM will.

    This process of creating a better aligned TM should result in an improvement of 10-20% on TM matching. Whether it's worth the effort and expense in doing this depends on the size of the DITA documents to be translated and the number of target languages. If the number of target languages is small, it may be more economical to retranslate fuzzy matches in a separate file. However, if the word count is high and there are many target languages, tuning the TM will always yield substantial translation savings.

    When tuning your legacy TM, take the following into account:

    • Unmatched tags — Unmatched tags can result from conditional text marked up in legacy tools (such as FrameMaker), or when block elements contain several sentences that share a common format marker (for example, a paragraph containing several sentences marked as bold; the first sentence contains only an opening bold tag, and the last sentence contains only a closing bold tag).

    • Segmentation rules — The segmentation rules used for translating legacy material may not be well suited for XML documents. For example, your legacy Word or FrameMaker-based segmentation rules may include a rule to terminate a segment after a colon, to separate a procedure title from the steps. Since DITA uses markup to indicate where the procedure title ends and the steps begin, this segmentation rule can be discarded.

When your DITA content is ready to be translated for the first time, do the following:

  1. Export the DITA documents to XLIFF.

  2. Import the XLIFF files into your CAT tool.

  3. Run the translation against the TM.

    You should get exact matching on the plain text and fuzzy matching on the tags. It may be possible to automatically recover 70% of the tags. Depending on the algorithm used to measure quality, this means you will achieve about 80% to 95% matching overall.

Once the translator has completed the translation, the TM should be exported as a TMX file. This TMX will now correctly tag the DITA block elements as well as correctly segment the sentences, and should therefore be used as the TM for the next DITA-based translation project. For future localization projects, the new TMX should yield exact matching at the segmentation level used for translation (block or sentence).

Notes

  • It should be noted that, in general, although sentence level segmentation provides better matching, working with segmentation at the block level improves the quality of the translation. For example, you may need three sentences in Spanish to translate two English sentences. The resulting Spanish translation will read better if the paragraph is translated as a block instead of isolated sentences.

  • If the best practices discussed above are used, the first translation of the DITA content can include new content. There is no need to translate the DITA content after migration to DITA before adding new content to the documents.

DITA Translation Subcommittee Meeting Minutes: 10 July 2006

(Recorded by Gershon Joseph <gershon@tech-tav.com>)

The DITA Translation Subcommittee met on Monday, 10 July 2006 at 08:00am PT
for 70 minutes.

1.  Roll call

    Present:
        Robert Anderson
        Don Day
        Kevin Farwell
        JoAnn Hackos
        Gershon Joseph
        Charles Pau
        Rodolfo Raya
        Yves Savourel
        David Walters

    Regrets:
        Felix Sasaki
        Andrzej Zydron

2.  Accept the minutes[1] of the previous meeting.

    Accepted. [Moved by Don, seconded by Yves, no objections.]

3.  Review open action items:

    ACTION -- Don to inform us where to publish our best practices.

        Closed. See summary[3].
    
    ACTION  -- Rodolfo to prepare an outline for this best practice document 
    for translating DITA and submit to list for discussion at a future meeting. 
    At that meeting, SC should agree on the outline.

        In progress.

4.  Returning business:

4.1 Discuss Gershon Joseph's next draft of the best practice for legacy TM
    
    In progress. Gershon hopes to submit draft tomorrow for review.

4.2 X-LIFF transforms continued discussion.
    Discuss Rodolfo's draft outline for an XLIFF best practice document
    for translating DITA if it is ready to discuss.
     
    --ACTION-- Everyone to read Rodolfo's XLIFF article[2] and
    provide feedback to Rodolfo this week via the mailing list.

4.3 Discuss JoAnn's draft Best Practices for Indexing. Want to clarify this 
    document. It will be incorporated into the full Best Practice statement 
    that JoAnn and Andrzej are authoring.

    --ACTION-- Robert and JoAnn to work on wording of note to keyword element 
    noting that it is considered inline when inside topics, but as standalone 
    segments when inside <keywords> (in prolog).

    --ACTION-- Rodolfo and Andrzej to investigate how translation tools can 
    differentiate between these keyword elements as inline and standalone 
    segments.

    --ACTION-- JoAnn to take to TC our concerns about both the start and end 
    index range markers holding the index term.
 
-- Meeting adjourned at 09:10am PT --

5.  New Business:

5.1 Handling multi-language documents

    Charles Pau and others to provide examples to the list for discussion

    Don requests to move this item to the agenda of the next meeting.

5.2 Andrzej provided examples of the need to change reusable building block 
    to the group for discussion. 

---
[1] http://lists.oasis-open.org/archives/dita-translation/200606/msg00018.html
[2] http://www.heartsome.org/EN/xliff.html
[3] http://lists.oasis-open.org/archives/dita-translation/200607/msg00008.html

Best Practice for Indexing DITA topics.doc



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]