[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: FW: [dita-translation] Please review best practice for reusing legacy TM
Best Regards, Gershon --- Gershon L Joseph Member, OASIS DITA and DocBook Technical Committees Director of Technology and Single Sourcing Tech-Tav Documentation Ltd. -----Original Message----- From: David Walters [mailto:waltersd@us.ibm.com] Sent: Monday, July 10, 2006 11:30 PM To: gershon@tech-tav.com Cc: bhertz@sdl.com; 'Bryan Schnabel'; Charles Pau; christian.lieske@sap.com; Dave A Schell; dita-translation@lists.oasis-open.org; dpooley@sdl.com; fsasaki@w3.org; 'Howard.Schwartz'; ishida@w3.org; 'Jennifer Linton'; KARA@CA.IBM.COM; mambrose@sdl.com; pcarey@lexmark.com; rfletcher@sdl.com; 'Munshi, Sukumar'; tony.jewtushenko@productinnovator.com; ysavourel@translate.com Subject: Re: [dita-translation] Please review best practice for reusing legacy TM I have made some comments in this attached file. (See attached file: translationBestPractice_daw.html) David Translation Tool Development and iSeries Globalization Support EMail: waltersd@us.ibm.com Phone: (507) 253-7278, T/L:553-7278, Fax: (507) 253-1721 CHKPII: http://w3-03.ibm.com/globalization/page/2011 TM file formats: http://w3-03.ibm.com/globalization/page/2083 TM markups: http://w3-03.ibm.com/globalization/page/2071 "Gershon L Joseph" <gershon@tech-tav To .com> <dita-translation@lists.oasis-open. org>, <mambrose@sdl.com>, 07/10/2006 01:34 <pcarey@lexmark.com>, PM <rfletcher@sdl.com>, <bhertz@sdl.com>, <ishida@w3.org>, <tony.jewtushenko@productinnovator. Please respond to com>, <christian.lieske@sap.com>, <gershon@tech-tav "'Jennifer Linton'" .com> <jennifer.linton@comtech-serv.com>, "'Munshi, Sukumar'" <Sukumar.Munshi@lionbridge.com>, Charles Pau/Cambridge/IBM@Lotus, <dpooley@sdl.com>, <fsasaki@w3.org>, <ysavourel@translate.com>, Dave A Schell/Raleigh/IBM@IBMUS, "'Bryan Schnabel'" <bryan.s.schnabel@tek.com>, "'Howard.Schwartz'" <Howard.Schwartz@trados.com>, <KARA@CA.IBM.COM> cc Subject [dita-translation] Please review best practice for reusing legacy TM Hi all, Please could you review the attached document and discuss comments on the list. We would like to finalize this document during next Monday's SC meeting. Thanks in advance for your help in making this best practice document useful to the wider DITA community. Best Regards, Gershon --- Gershon L Joseph Member, OASIS DITA and DocBook Technical Committees Director of Technology and Single Sourcing Tech-Tav Documentation Ltd. office: +972-8-974-1569 mobile: +972-57-314-1170 http://www.tech-tav.com (See attached file: translationBestPractice.html)Title: Best Practice for Leveraging Legacy Translation Memory when Migrating to DITA
Gershon Joseph
Tech-Tav Documentation Ltd. Rodolfo Raya
Heartsome Holdings Pte. Ltd. Many organizations have
previously translated content that was authored in non-DITA tools (such as Word
and FrameMaker). When migrating
their legacy content into the new DITA authoring environment, what does the
organization do about their legacy translation memory? This legacy translation
memory (TM) was created with large financial investment that can't easily be
thrown away simply because a new authoring architecture is being adopted. This article describes best
practices that will help organizations to use their legacy TM for future
translation projects that are authored in DITA, in order to minimize the
expense of translating DITA-based content. Before we get into the details,
let's define the terms used in the localization industry so that subsequent
sections will be better understood. CAT Computer Aided Translation, which helps the translator translate the source content. CAT tools
usually leverage Translation Memory to match sentences and inline phrases that
were previously translated. In addition, some CAT tools use Machine Translation
to translate glossary and other company-specific terms (extracted from a
terminology database). Matching The level of accuracy with which CAT tools can match content being
translated to the TM.
The levels of matching are defined as follows: Fuzzy matching The
source segment being matched is similar, but not identical to, the source
language segment in the TM. Leveraged matching The
source segment being matched is identical to the matched segment, but the context
is not known. Exact matching The
source segment being matched is identical to the matched segment and comes from
exactly the same context. MT Machine Translation is a technology that translates content directly from source without human
intervention. Used in isolation, MT usually generates an unusable translation.
However, when integrated into a CAT tool to translate specific terminology, MT
is a useful technology. TM Translation Memory is a technology that reuses translations previously stored in the database
used by the translation tool. TM preserves the translation output for reuse
with subsequent translations. TMX Translation Memory eXchange is an industry standard format for
exchanging TM between CAT tools. XLIFF XML Localisation Interchange File Format is a document format used to exchange
translatable content between CAT tools. If you keep the following points
in mind, you should be able to maximize your existing translation memory when
you send your DITA documents for translation: ##DAW. ·
Ensure your translation service provider uses a tool that supports TMX
(Translation Memory eXchange). This will ensure you
can migrate your TM between CAT tools that support the
industry standard for TM interchange. This is important not only to free you
from dependence on a single translation service provider, but also to allow you
to fine-tune your segmentation rules to better match your DITA-based XML source
documents you'll be sending for translation. ·
Provided the structure of the DITA-based content has not changed radically
compared to the legacy documents, the CAT software should achieve exact
matching on most segments in the TM. As long as the legacy TM aligns with the
DITA source at the ·
Inline elements may not match at all, or may only be fuzzy
matches. If a CAT tool is used to preprocess the TM to
prepare it for the DITA-based translation project, then inline elements should
yield an exact match. Note that a ·
If conrefs are used as containers for reusable
text, then these items may not exactly match (only fuzzy match at best).
However, since each of these items needs to be translated only once, and should
at least fuzzy match, it should not result in significant translation expense.
For best practices on using conref elements in DITA
documents that need to be translated, please see XREF TO CONREF BEST PRACTICE. ·
When text entities are used as containers for reusable text, it is
preferable to use a CAT tool that extracts translatable text from the XML files
using an XML parser. The XML parser will insert the content of the text
entities into the source text that the translator uses as a reference. This
allows the translator to check that the translated segments flow correctly in
the target language. If text entities are translated separately from the
context where they are used, there may be grammatical inconsistencies in the
final text when the translated DITA files are published. ·
You can export the legacy TM to a TMX file. The TMX file is an XML file,
which can be manipulated to better align the translation segments with the DITA
markup. The modified TMX file can then be converted back into a TM. This new TM
will provide more exact matching against your DITA content than the legacy TM
will. ##DAW. When
tuning your legacy TM, take the following into account: o
Unmatched tags — Unmatched tags can result from
conditional text marked up in legacy tools (such as FrameMaker),
or when block elements contain several sentences that share a common format
marker (for example, a paragraph containing several sentences marked as bold; the
first sentence contains only an opening bold tag, and the last sentence
contains only a closing bold tag). o
Segmentation rules — The
segmentation rules used for translating legacy material may not be well suited
for XML documents. For example, your legacy Word or FrameMaker-based
segmentation rules may include a rule to terminate a segment after a colon, to
separate a procedure title from the steps. Since DITA uses markup to indicate
where the procedure title ends and the steps begin, this segmentation rule can
be discarded. When your DITA content is ready
to be translated for the first time, do the following: 1.
Export the DITA documents to XLIFF. 2.
Import the XLIFF files into your CAT tool. 3.
Run the translation against the TM. You
should get exact matching on the plain text and fuzzy matching on the (segments containing inline?) tags. It may be possible to automatically recover
70% of the (breaking or inline?) tags. Depending on the algorithm used to
measure quality, this means you Once the translator has completed
the translation, the TM should be exported as a TMX file. This TMX will now
correctly tag the DITA block elements as well as correctly segment the
sentences, and should therefore be used as the TM for the next DITA-based
translation project. For future localization projects, the new TMX should yield
exact matching at the segmentation level used for translation (block or
sentence).
·
It should be noted that, in general, although sentence level segmentation
provides better matching, working with segmentation at the block level improves
the quality of the translation. For example, you may need three sentences in
Spanish to translate two English sentences. The resulting Spanish translation
will read better if the paragraph is translated as a block instead of isolated
sentences. ·
If the best practices discussed above are used, the first translation of
the DITA content can include new content. There is no need to translate the
DITA content after migration to DITA before adding new content to the
documents. |
Many organizations have previously translated content that was authored in non-DITA tools (such as Word and FrameMaker). When migrating their legacy content into the new DITA authoring environment, what does the organization do about their legacy translation memory? This legacy translation memory (TM) was created with large financial investment that can't easily be thrown away simply because a new authoring architecture is being adopted. This article describes best practices that will help organizations to use their legacy TM for future translation projects that are authored in DITA, in order to minimize the expense of translating DITA-based content. Before we get into the details, let's define the terms used in the localization industry so that subsequent sections will be better understood.
If you keep the following points in mind, you should be able to maximize your existing translation memory when you send your DITA documents for translation:
When your DITA content is ready to be translated for the first time, do the following:
Once the translator has completed the translation, the TM should be exported as a TMX file. This TMX will now correctly tag the DITA block elements as well as correctly segment the sentences, and should therefore be used as the TM for the next DITA-based translation project. For future localization projects, the new TMX should yield exact matching at the segmentation level used for translation (block or sentence).
|
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]