Subject: XLIFF 2.0 Comments
Dear XLIFF TC Member,Please find attached comments/observations/questions/ideas concerning the XLIFF 2.0 working draft dated April 16, 2032 (http://docs.oasis-open.org/xliff/xliff-core/v2.0/csprd01/xliff-core-v2.0-csprd01.html). Please feel free to contact me for clarifications if anything is unclear.
One of the important benefits of a standardized localization interchange format is to ensure interoperability between the different processes and employed tools involved in the localization supply chain, which certainly may vary from use case to use case. Although the actual requirements on what interoperability -- syntactically and semantically -- means, and the associated needs and demands of a specific (enterprise) case may differ in various degrees, there should be consensus about a common set of features that need to be addressed for a seamless interchange of data and metadata. Reading the draft with this assumption in mind, I hope that the proposed modifications and additions help to accomplish the goal of interoperability. I apologize in advance if a reply to my comments may require that discussions which presumably already took place may have to be summarized.
Best regards, Jörg --*Prof. Dr. Jörg Schütz* *|* bioloom group *|* Bahnhofstr. 12 *|* D-66424 Homburg *|* Fon +49-6841-756-338 *|* Mobile +49-170-801-9982 *|* firstname.lastname@example.org
*bioloom group* *|* Vertreten durch / Represented by: Prof. Dr. Jörg Schütz *|* Sitz / Register: Homburg *|* USt-IdNr. / Tax-Id.: DE261087278 *|* Web: www.bioloom.de
1. General Overview In this section, I give a general feeling -- so to say a brief snapshot on where critical elements have been identified in the specifications. The subsequent sections will then go into further details. In general, the specification lacks examples -- sometimes only the obvious cases are exemplified, and not those where guidance would be appropriate or even nescessary, e.g. one full fledged XLIFF example -- and a rational for the various design decisions, especially those that deviate from the previous XLIFF version 1.2 as well as a brief overview of the XLIFF evolution. XLIFF 2.0 adds complexity by either unnecessarily blowing up the XML structure, i.e. group-unit-segments, or spreading information into modules that easily could be maintained at the core level, or supporting a very specific view on dealing with resp. handling inline code, i.e. there is apparently some bias towards RTF inline codes. An example of the module case is that the <alt-trans> element of version 1.2 has been moved to the module Translation Candidates (TCM). TCM also assumes a very specific view on translation candidates, particularly their assessment (rating, etc.) -- see attributes similarity, matchQuality, and matchSuitability. Nevertheless, a different fine-grainedness could be maintained by attributes from any namespace which, however, wouldn't contribute to general seamless interoperability. Regarding possible extensions, XLIFF 2.0 is very restrictive compared to version 1.2, which on the one hand, is good for maintaining interoperability, but on the other hand, lacks flexibility for certain use cases. Extension points with either the Metadata module element <mda:metadata> or a custom XML namespace are only allowed at the 3 element levels <file>, <group> and <unit>, and with the <mda:metadata> element we have in addition <segment>, <ignorable> and <mtc:match> from TCM. Custome attributes are permitted for elements <file>, <group>, <unit>, <segment>, <ignorable>, <source>, <target>, <note>, <mrk>, and <sm>. In summary, XLIFF 2.0 on the one hand, extends the previous version 1.2 in terms of expressiveness because it adds capabilities that make it a package-like container format (mostly through modules), and on the other hand, delimits customization in a very strict way and forces certain assumptions which are quite application specific. 2. Details on Core 2.1 Processing Requirements It should be mentioned that possible XML processing instructions (PI) must be preserved by tools that process XLIFF 2.0, and that cannot handles these PIs. 2.2 Structure and Structural Elements A comparison with the previous 1.2 version would support the general understanding of the new design and its overall rational. This comparison should also include the relationship between the core elements and the module elements, and the general approach that is chosen to distribute data and metadata (in the broad sense) between these elements including possible best or good practices. The annotation elements <sm> and <em> are just specialized cases of the <mrk> element. Since they don't add any additional value to the inline annotation markup, they could be subsumed under <mrk>. Therefore, a justification of their existence would be most valuable. It is unclear if for the "term" type the 'ref' attribute could be used to establish a relationship with entries in the Glossary module. The Glossary module does not have a mechanism, e.g. an attribute such as 'termId', or even an element, that allows for dereferencing (see also section 3.2). All other defined inline elements add structural complexity to the format, and they could be easily replaced by only 2 inline code types, one standalone -- which includes the <cp> case of Unicode characters that are invalid XML -- and one with a start and end marker. The need for the introduced different types is unclear, and the exemplification through RTF code is not very helful because it represents a very specific application case. Inline codes should simply help to process (either by human or machine) the content, and trigger appropriate translations including possible markup within the content. The existing attributes would certainly apply to these 2 inline code types. In addition, the content of the elements <originalData> and <data> would be simplied too if they are actually needed -- remember that their content might be used differently by tools, and might therefore lead to incompatibilities. Therefore, these elements might be candidates for the Resource Data module to actually guarantee interoperability (see also section 3.5). 2.3 Attribues It might be more appropriate to maintain consistency between abbreviations used for target (language and directionality), i.e. tgt vs. trg. In the case of directionality we might even abandon the source/target distinction, and just use the attribute 'dir.' The attributes 'state' (no customization) and 'subState' (for customization) could be collapsed into one state attribute with pre-defined ('xlf:' namespace) and customized values. The attributes 'canCopy', 'canDelete', 'canOverlap', and 'canReorder' used in conjunction with inline code are helpful because they add value to the processing (human and machine), and therefore should be retained if the previous suggestion of using just 2 inline codes would be adopted. 3. Details on Modules 3.1 Translation Candidates Module The Translation Candidates module is a replacement of the <alt-trans> element in XLIFF 1.2, and provides a means to maintain alternative translations (in particular translation automation) for the translatable content. The module is not very restrictive in the attribute selection, and might therefore be hijacked for arbitrary customization purposes. An exception, however, is the attribute 'type' for which standard values are provided. Because of the stated processing requirements, these standard values should be further explained and justified. This module particularly lacks a contextual reference (before/after; previous/subsequent) which certainly would be very helpful for human and machine processing (even in fully automated cases). The Resource Data module might be a place for such contextual information but only the Metadata module is a permitted element in this module. 3.2 Glossary Module The Glossary module is a very simple incarnation of a bi-lingual terminology resource (source and target language of the <xliff> element) that does not offer either a mechanism to relate the <term> entries with <source> and <target> content or any other means to accomaplish such a relationship by, for example, a term or even a concept identifier. Variations or synonyms are also not forseen, and always require a new entry. The only attribute that is required is 'source' for the <definition> element which is certainly very bizarre in this context. The module has it is defined in the specification is useless because it only provides an isolated data bag. 3.3 Format Style Module The Format Style Module offers a mechanism to support the generation of a simple HTML preview. Although limited, e.g. the embedding of images is not allowed, it might add value to the human translation process. A more sophisticated example should be provided anyway. 3.4 Metadata Module The Metadata module is a very simple container format for customized data that should support the processing of the content data. An example should be provided to illustrate the relationship with the content data. 3.5 Resource Data Module The Resource Data module is yet another data container that specifically can refer to external data, and might also present certain contextual information (see Section 3.1). However, for employing this module to provide guidance to the translator or the processing tool it might be misplaced under the <file> element, and could certainly also useful on the <unit> and <segment> level to provide preceding and subsequent contextual content information (see Section 3.1). In addition, further examples should be provided to clarify the purpose and rational of this module. 3.6 Change Tracking Module The Change Tracking module permits the adding of processing information to the content elements <segment> and <unit>, and provides a useful means for maintaining and curating lifecycle information including provenance. This module is a good example of how to establish references between different information elements which are missing in other modules such as the Glossary Module in particular (see Section 3.2). 3.7 Size Restriction Module The Size Restriction module provides means to encode size restrictions based on so-called restriction profiles (<profiles>). A <normalization> element specifies with 2 attributes ('general' and 'storage') how to normalize the content that should be processed. In both cases only the normalization forms C and D as specified by the Unicode Consortium are supported (values being "none", "nfc", and "nfd"). This module is yet another good example of a well-defined and extensible module (through the provision of additional profiles). 3.8 Validation Module The Validation module defines a container format for a couple of validation rules that should be applied to the translated content (<target> elements) of an XLIFF file. Rules are simple test cases that should be applied to the associated content, and sometimes relate <source> and <target> content as well as normalization (see section 3.7). The execution of the tests should or can be automated. The defined processing requirements or better rule definition requirements, however, delimit the entire flexibility of the module, and therefore the module description should provide additional clarification and justification. 4. Conclusion XLIFF 2.0 is a conglomerate of a file based and container based interchange format that, on the one hand, represents translatable content (the bilingual file aspect), and on the other hand, provides processing support and maintenance data (the module aspect). As such it extends the previous version 1.2 with a kind of transport layer that is useful to track the entire lifecycle of localizable and translatable content. However, the specification is partly based on biased assumptions about possible use cases, e.g. the handling of incline codes, and ignores elements that would be very useful and helpful in processing the translatable content, for example, terminological references (i.e. term-to-glossary) and contextual information (i.e. preceding/subsequent content), as well as more task related meta information. One solution might be the strict seperation of concerns, which means the bilingual content part that would include preview, terminology, and metadata related to change tracking, restrictions and validation, and the processing/task/transport part that would include task related information and external resources. For the latter part the ISO/TS 11669 (Translation Projects) specifications could be an additional source of inspiration. Last but not least, the work of the Interoperability Now! initiative (http://code.google.com/p/interoperability-now/) should be taken as yet another source of inspiration. In particular, their work on xliff:doc, i.e. the bilingual content part, is entirely based on XLIFF 1.2 (100% compatible), and their package format TIPP resembles some of the processing/task/transport part requirements. Obviously, XLIFF 2.0 does not maintain any backward compatibility to version 1.2.