Best Practice for Leveraging Legacy Translation Memory when Migrating to DITA
Many organizations have previously translated content that was authored in non-DITA tools (such as Word and FrameMaker). When migrating their legacy content into the new DITA authoring environment, what does the organization do about their legacy translation memory? This legacy translation memory (TM) was created with large financial investment that can't easily be thrown away simply because a new authoring architecture is being adopted.
This article describes best practices that will help organizations to use their legacy TM for future translation projects that are authored in DITA, in order to minimize the expense of translating DITA-based content.
Before we get into the details, let's define the terms used in the localization industry so that subsequent sections will be better understood.
Computer Aided Translation, which helps the translator translate the source content. CAT tools usually leverage TM to match sentences and inline phrases that were previously translated. In addition, some CAT tools MT to translate glossary and other company-specific terms (extracted from a terminology database).
Machine Translation is a technology that translates content directly from source without human intervention. Used in isolation, MT usually generates an unusable translation. However, when integrated into a CAT tool to translate specific terminology, MT is a useful technology.
Translation Memory is a technology that reuses translations previously stored in the database used by the translation tool. TM preserves the translation output for reuse with subsequent translations.
Translation Memory eXchange, which is an industry standard format for exchanging TM between CAT tools.
XML Localisation Interchange File Format is a document format used for exchanging translatable content between CAT tools.
Recommended Best Practices
If you keep the following points in mind, you should be able to maximize your existing translation memory when you send your DITA documents for translation:
Ensure your translation service provider uses a tool that supports TMX (Translation Memory eXchange). This will ensure you can migrate your TM between CAT tools that support the industry standard for TM interchange. This is important not only to free you from dependence on a single translation service provider, but also to allow you to fine-tune your segmentation rules to better match your DITA-based XML source documents you'll be sending for translation.
Provided the structure of the DITA-based content has not changed radically compared to the legacy documents, the TM software should fully match most sentences. As long as the legacy TM aligns with the DITA source at the sentence level, the translation software should be able to fully leverage matching. Good CAT tools will break the DITA block elements down into sentence-level segments, which will ensure better matching of the legacy TM. Usually, the DITA content is transformed into XLIFF, which can handle segments at the block or sentence level.
Inline elements may not match at all, or may only fuzzy match. If the TM is preprocessed to prepare it for the DITA-based translation project, then inline elements should fully match. Note that a good TM engine should help you recover 70% of the inline tags, which is the main area where matching is prone to fail.
If text entities and/or conrefs are used as containers for reusable text, then these items may not fully match (only fuzzy match at best). However, since each of these items needs to be translated only once, and should at least fuzzy match, it should not result in significant translation expense.
If you tweak the TMX (exported from the legacy TM) to better align it with the new DITA content, you should realize an improvement of 10-20% on TM matching. Whether it's worth the effort and expense in doing this depends on the size of the DITA documents to be translated. The idea is to export the TM to TMX, process the TMX to better align the sentence-level segmentation with your DITA content, and then import the TMX back into your TM. Thus, your TM will now be better aligned with your DITA content, which will result in more accurate matching.
One area where things can go wrong is if the legacy content does not use matched tags. Unmatched tags can result from conditional text marked up in legacy tools (such as FrameMaker), or when block elements contain several sentences that share a common format marker (for example, a paragraph containing several sentences marked as bold; the first sentence contains only an opening bold tag, and the last sentence contains only a closing bold tag). In the case of unmatched tags in a segment, you can expect only fuzzy matching.
After the first translation of the DITA content has been completed, the problem with unmatched tags should disappear.
When your DITA content is ready to be translated for the first time, do the following:
Export the DITA documents as XLIFF.
Import the XLIFF files into your CAT tool.
Run the translation against the TM.
You will get perfect matching on the sentences, and fuzzy matching on the tags. You can expect 100% matching on the plain text and 70% recovery on the tags. Depending on the algorithm used to measure quality, this means you will achieve about 80 to 95% matching overall.
Once the translator has completed the translation, the TM should be exported as a TMX file. This TMX will now correctly tag the DITA block elements as well as correctly segment the sentences, and should therefore be used as the TM for the next DITA-based translation project. For future localization projects, the new TMX will yield perfect matching at the segmentation level used for translation (block or sentence).
While the above discussion emphasize working at the sentence level, it should be noted that in general, although the sentence level provides better matching, working at the block level improves the quality of the translation. For example, you may need three sentences in Spanish to translate two English sentences. The resulting Spanish translation will read better if the paragraph is translated as a block instead of isolated sentences.