Comment on XLIFF 2.0 30 day draft

Dear XLIFF TC Members,

Below, please find my comments on the XLIFF Version 2.0 Committee Specification Draft 01, dated April 16 2013 and found at http://docs.oasis-open.org/xliff/xliff-core/v2.0/csprd01/xliff-core-v2.0-csprd01.html. Thank you for providing this opportunity to comment.

My viewpoint in reading this document is as a tool implementor, and so my concerns are related to the questions:

Is the meaning of the specification clear?
If I was tasked with implementing this specification in one or more tools, would I be able to do so?
Would I expect that such an implementation would be implemented consistently with out implementations made independently by other people?

I will offer some general thoughts, followed by more specific notes.

General Thoughts

The document is severely lacking in examples, which makes it difficult in some cases to even guess what use cases it was trying to satisfy. The purpose of a number of minor features is non-obvious and needs to be better justified.
The complexity of XLIFF 2.0 is much greater than that of XLIFF 1.2, as is the scope of problems it is attempting to solve. I would strongly urge the TC to pursue multiple independent implementations of the format before finalizing the specification in order to flush out problems that are not immediately apparently during the reading.
What is XLIFF 2.0 meant to be? The original vision of XLIFF seemed (to me) to be a normalization layer for translatable content - a lingua franca that would allow reliable interchange between organizations. In previous versions, this vision was hamstrung by well-documented problems with features being inconsistently implemented. XLIFF 2.0 addresses some of these concerns, but it has also become more ambitious. I think there is something of an identity crisis lurking in this document: XLIFF 2.0 is not sure if it's meant to be a normalization/interchange format, or a complete representation of all the available/useful metadata that can be abstracted out of a source document. The union of those two things is a very large format. XLIFF 2.0 is a very large format, but I'm not sure it's large enough to do everything it attempts. My main worry is that a lot of functionality has been added, but incompletely, and this means that the only problems it will solve completely will be relatively simple ones.

Specific Issues

Basic structure

The <file> element (2.2.2.2) is described as a "container for localization material extracted from a single document/source." This language is actually less restrictive than the language in XLIFF 1.2 ("…single extracted original document.") Unfortunately, even in XLIFF 1.2 this was not implemented consistently. Some tools adopt the concept of a sub-file "page" unit (eg, a single worksheet from an Excel document, a single PowerPoint slide, or a single page from a multi-page document in Word, InDesign, etc), and some implementations mapped these pages to the <file> element, while others would map it to the entire file. This practice will continue with XLIFF 2.0.
The intended use of the <ignorable> element is not clear from its definition in section 2.2.2.7.
The notes element (2.2.2.9) allow formatting style information (@fs:fs, @fs:subFs). Why? My understanding of the purpose of <note> was to allow for comment data that was made during the localization lifecycle (ie, after text had been extracted from the native source, and before the translated target document was created), but the fs markup implies that it may also carry notes from the underlying content as well. It's also possible that the fs attributes are intended to allow richer text in <note> content, but this seems like a strange way to go about it.

SLR Module

Why aren't sizeInfo (H.1.4.9) and equivStorage (H.1.4.8) named consistently? They perform a similar function along different axes (size vs storage). Possibly consider renaming sizeInfo to equivSize, or alternately renamining equivStorage to storageInfo.
Can sizeInfoRef (H.1.4.10) point to data outside the XLIFF document itself? It seems that the intention is for it to always point to content within a local <data> element, but it is unclear.
SLR data is schematically valid in places where its meaning is not obvious. For example, it could be attached to a <group> element that contained a mix of segment content and <ignorable> elements. Is the size/storage requirements of <ignorable> content counted towards the overall totals for the <group>?

Metadata Module

No use case is provided for this, and there are no processing expectations. Is this data to be maintained during processing?

Validation module

Examples are needed.
The processing requirements for validation (I.1.2.2) include, "When <validation> occurs at the <group> level, rules must be applied to all <target> elements within the scope of <group>, except where overrides are specified at the <unit> level." What about in the case where <group> elements are nested? Can a nested <group> override the validation rules of a parent <group>?
The operation of the rule override mechanism is not obvious. In particular, I'm not sure how the disabled attribute (I.1.3.6) meant to be used. For example, suppose there are multiple "mustLoc" rules that are defined in a given group's validation data. How would a nested group or unit disable one of those rules? Is the intention that the entirety of the rule should be reproduced, with the addition of the disabled attribute? I think this is the only possible way, since "disabled" offers no other way to reference a specific rule.
Regarding the occurrences attribute (I.1.3.3):

The use of double quotes as an escaping mechanism is an unusual choice given that " is a problematic character in XML attribute values.
The value space is sufficiently complex that it may be better to just use an explicit XML schema. This would be more verbose, but would simplify implementations because it would remove the need for a one-off occurrence parser. Additionally, the use of this attribute both as a way to both require occurrences (eg "(foo)(1)") and also to require that things not occur (eg "(foo)(0)") seems like a semantically tricky overloading of this arbitrary syntax. A real schema would make the desired behaviors more explicit.

Regarding the mustLoc attribute (I.1.3.4):

Similar to comments about occurrences, the overloading of this attribute to mean both "must contain" and "must not contain" seems unnecessarily complicated. Why not just split this into @mustLoc and @mustNotLoc, or similar? This would also simplify implementations that would no longer need to special-case the parsing of these attribute values.

Matches module

The value of an optional id attribute in <match> is dubious. A required id might provide value, but an optional id provides roughly as much value as none at all to a consumer.
I'm not sure I disagree, but the bifurcation between similarity and matchQuality attributes strikes me as odd. I understand the different cases in which they might be used, but what on earth is a tool meant to do with both of them? This is exacerbated because matchQuality has no prescribed meaning. It might be an MT confidence score (which would be useful), or it might not. In practice, I feel like 99% of the time these two values will be the same, and the other 1% of the time, they will be different -- in which case the meaning is ambiguous.

Change Tracking Module

This is mostly a matter of preference, but I don't like the ad-hoc referencing mechanism used to attach revision data. I would prefer to see a more robust system based on RDF or something similar.
No processing restrictions are given for the nid attribute. It is strange that for example appliesTo could specify "note", but nid could be absent. In this case, it would not be clear what note the revision refers to.
The stated purpose of the @checksum attribute is to detect changes in the revision data from non-compliant parsers. In that case, why not have checksums on the source content itself (or match proposals, etc)? It seems strange to only place this protection here.

Glossary Module

I am not a term expert, but I am concerned that this schema is overly simplistic. There is no way identify correlate term entries with segment content. The per-term metadata is very limited; in particular term variations are not supported.

Please contact me with any questions.

Thanks,

Chase Tingley <chase@spartansoftwareinc.com>

Spartan Software Inc.

San Francisco, CA

xliff-comment message