xliff message

Subject: RE: [xliff] 2.0 Validations Module Proposal
From: "Estreen, Fredrik" <Fredrik.Estreen@lionbridge.com>
To: Ryan King <ryanki@microsoft.com>, Yves Savourel <ysavourel@enlaso.com>, "xliff@lists.oasis-open.org" <xliff@lists.oasis-open.org>
Date: Wed, 28 Nov 2012 15:13:02 +0000
Hi Ryan, Yves,

I have been thinking about this proposal for some time now and have a few suggestions and comments. First I'd like to point out that I'm really in favor of a validation feature in the standard. But to make it part of the standard we need to ensure it is providing an interoperable baseline and works well with the other features in the standard. I have add to the discussion inline bellow and put some additional thoughts at the end.

> -----Original Message-----
> From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] On Behalf
> Of Yves Savourel
> Sent: den 20 november 2012 03:06
> To: xliff@lists.oasis-open.org
> Subject: RE: [xliff] 2.0 Validations Module Proposal
> 
> > [ryanki] Good question. Maybe an attribute should be added to allow a
> > user to define the regex language to use?
> 
> That one step toward better interoperability and at the same time, possibly
> toward less:
> Tools can identify what is the regex syntax, but now they have to implement
> more than one syntax.
> I don't have an answer, just thinking aloud.
> 


I agree that we need to specify either an existing flavor (or possibly a set) as the one used in the standard. And if several are allowed provide a way to specify which is in use. The other option is to go the long route of actually specifying what features and behavior we require of the regex processor and the provided syntax. The later will give a more independent standard but will most likely be harder on the implementers unless we restrict ourselves to a small common subset of regex.

> 
> > [ryanki] Since maxLength is just another type of validation, we would
> > advocate replacing it with the more general <validations> module.
> 
> That's a good argument.
> Based on the module proposal Fredrik just posted it seems a complex one.
> maybe maxLength is just one of the many profiles?
> just thinking aloud here too.
>

Given the difference in design between the two proposals I think it would be quite difficult to merge them into a coherent single proposal as is. That could of course change. My other concern is that keeping modules small promotes acceptance and implementation of them by tool vendors. It is in my opinion much more likely that a small and simple module will be implemented completely and correctly (or at all) than that a large and complex one will. I really believe that the chance of getting both implemented is bigger if they are split into manageable pieces. But I'm not strictly against trying to merge it if it turns out to result in a coherent feature supporting both current proposals.
 
> 
> > [ryanki] If you have the following source “Hello Microsoft”
> > the tendency would be to use <mrk> to annotate it, or similarly, if I
> > have “Hello %s”, the tendency might be to use <ph> to encode it.
> > However, both cases introduce markup into my source that I may have to
> > normalize during recycling to get a 100% match. So having a noLoc rule
> > is a way to provide a “cleaner, no post-processing needed”
> > source for recycling.
> 
> But now you have also a "don't translate" information decoupled from the
> segment that tools have to carry along with it. In many use cases having the
> inline markup is simpler and easier to work with (e.g. send the text to MT,
> etc.) Just thinking aloud here too.
> 

I agree with Yves here that the rule type as proposed seems to overlap with the translate=no markup in the inline content. Since validation should be separate from the translation process, which I'll expand on bellow, I think it is more of an issue with name and description than a technical one. I would propose to rename "noLoc" to "mustExist" and be described as validating that certain text or text pattern exist in the target. Then it can be used for both the current use case but also for assuring specific localization in addition to no localization. And the ambiguity of the name is removed.

> cheers,
> -yves
> 

When reading the proposal I could not find any dependence of the validation based on source content. In essence the expressions are only tested against the target. I think it would be useful to be able to place common rules on a higher level of the document and predicate them with a pattern matched against source. Like {if matches(source, "foo") then assert(matches(target, "bar"))} . 

In the same vein of allowing more rules to be place on a higher level, one could consider if the multiplicity of matches should be specified for the expression or if expression writers should instead write expressions in such a way that the expression can only match a predefined number of times. The later can quickly become complicated but is not impossible. Just think about matching the word "hello" exactly three times in an arbitrary string rejecting it if there are four "hello" in it. If rules are predicated by source matches do we want the multiplicity to be communicated between the source match and the expected match multiplicity on the target?

An additional detail on the matching is that I do not see any information in the proposal about how it will relate to inline elements. Regex alone will not be able to take elements into account. I think SRX has some support of it, not sure how from the top of my head. At a minimum we need to specify how the expressions should match against what. Like if they match against the whole text of the target with inline elements removed, against each sub string delimited by inline elements or if there is some transform to replace the inline elements with text. Since I think allowing validation rules to take tagging into account it would be useful and in the spirit of how the Xliff format is intended to be used, if we could find a solution that support them. One possibility I have thought a little about is to allow match expressions to be specified using XPath instead or in addition to regular expressions. Using XPath 2.0 would allow including regular expressions in it. I don't have any clear details on how this would work .

Since rules are can be placed on the <segment> level I think it would be good to define how a processor merging or splitting segments should handle the rules if it supports this module. The core rules for processors not supporting this module would likely not be able to produce good results unless the requirement is to drop the validation rules, at least not without the source predicate bit. With that bit the rules could most probably live at the <unite> level to start with though. It might also be that the rules would become too complex and that no special rules for processors supporting this module should be provided and they should only use the core processing requirements for this task.

Lastly I think it should be clear that the validation is not supposed to replace or extend the normal behavior of placeable entities in the unit content provided by inline element markup. It should purely be a validation feature and there should not be any expectation that tools lock text or make specific text placeable in UI or other processes.

Regards,
Fredrik Estreen
Follow-Ups:
- RE: [xliff] 2.0 Validations Module Proposal
  - From: Ryan King <ryanki@microsoft.com>
References:
- 2.0 Validations Module Proposal
  - From: Ryan King <ryanki@microsoft.com>
- RE: [xliff] 2.0 Validations Module Proposal
  - From: Ryan King <ryanki@microsoft.com>
- RE: [xliff] 2.0 Validations Module Proposal
  - From: Yves Savourel <ysavourel@enlaso.com>