xliff message

Subject: RE: [xliff] 2.0 Validations Module Proposal
From: Yves Savourel <ysavourel@enlaso.com>
To: <xliff@lists.oasis-open.org>
Date: Thu, 29 Nov 2012 13:59:00 -0700
Hi Ryan, Fredrik, all,

While using an XPath expression may have the nice effect of eliminating the problem of the different Regex engines/syntaxes, it has:

a) the (minor) drawback of still having to deal with potentially several versions of XPath, as Ryan noted.

and b) it is very difficult to implement when not working directly on the XLIFF documents.
It seems we tend often to forget that XLIFF is an *exchange* format and most tools on both ends will read the document in a structure/database that has nothing to do with XML. Sure one can convert back the entry into an XML fragment and apply the validation, but that is a lot of work.

I can't think of a good solution for this, but I'm also not sure XPath is a better solution than regex: it solves some issues but bring new ones.

cheers,
-ys

-----Original Message-----
From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] On Behalf Of Ryan King
Sent: Thursday, November 29, 2012 12:56 PM
To: Estreen, Fredrik; Yves Savourel; xliff@lists.oasis-open.org
Subject: RE: [xliff] 2.0 Validations Module Proposal

Frederick, thanks for the constructive feedback. It seems that there has been some misconception that the original examples given were comprehensive, or the ones that should necessarily be defined, but they weren't. They only served as examples of the direction we might take. The original intent was to provide a validations module that would 1) Define a standard set of rules and descriptions that tool implementers could build business logic around to apply to the string or substring indicated  by the match expression (defining those standard rules and what elements they affect would have been a wider committee activity) and 2) In the absence of a standard rule and description, allow a "custom rule", which would essentially be nothing more than the pass/fail of the match expression, and the action taken would be left to the user agent.

In the meantime, we have looked at what the current inline markup in the spec offers and it does cover much of our current validation needs (though we still have the issue of normalizing strings for recycling, which we will need to deal with). Using XPath also sounds promising and easier for authors to write and maintain, plus it has the distinction of being a w3c standard (but we may need to think about 1.0 vs. 2.0 since 1.0 is in wider use). So, in the interest of time and simplicity, here is how we propose to move forward in this version of the standard: Allow users to store XPath expressions, an identifier (e.g. a rule name), and possibly a description in the module. Tool implementers could evaluate the success/failure of the match and take some action for that particular identifier according to their tool's business logic based on the description. The module can live at the <file>, <group>, or <unit> level, and because rules may indeed become complex, only standard core processing rules would apply.

<val:validations>
  <validation rule=”mustExist” desc=”Match string should exist in both source and target.”>{XPath Expression}</validation>  </val:validations>

It will be up to further discussion and debate if we want to define a "standard" set of rules and descriptions in the module, which don't overlap with current inline markup, or simply allow users to define their own set as a first implementation of the module.

Can we move forward with a vote on this one to approve or not approve?

Thanks,
Ryan

-----Original Message-----
From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] On Behalf Of Estreen, Fredrik
Sent: Wednesday, November 28, 2012 7:13 AM
To: Ryan King; Yves Savourel; xliff@lists.oasis-open.org
Subject: RE: [xliff] 2.0 Validations Module Proposal

Hi Ryan, Yves,

I have been thinking about this proposal for some time now and have a few suggestions and comments. First I'd like to point out that I'm really in favor of a validation feature in the standard. But to make it part of the standard we need to ensure it is providing an interoperable baseline and works well with the other features in the standard. I have add to the discussion inline bellow and put some additional thoughts at the end.

> -----Original Message-----
> From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org]
> On Behalf Of Yves Savourel
> Sent: den 20 november 2012 03:06
> To: xliff@lists.oasis-open.org
> Subject: RE: [xliff] 2.0 Validations Module Proposal
>
> > [ryanki] Good question. Maybe an attribute should be added to allow 
> > a user to define the regex language to use?
>
> That one step toward better interoperability and at the same time, 
> possibly toward less:
> Tools can identify what is the regex syntax, but now they have to 
> implement more than one syntax.
> I don't have an answer, just thinking aloud.
>


I agree that we need to specify either an existing flavor (or possibly a set) as the one used in the standard. And if several are allowed provide a way to specify which is in use. The other option is to go the long route of actually specifying what features and behavior we require of the regex processor and the provided syntax. The later will give a more independent standard but will most likely be harder on the implementers unless we restrict ourselves to a small common subset of regex.

>
> > [ryanki] Since maxLength is just another type of validation, we 
> > would advocate replacing it with the more general <validations> module.
>
> That's a good argument.
> Based on the module proposal Fredrik just posted it seems a complex one.
> maybe maxLength is just one of the many profiles?
> just thinking aloud here too.
>

Given the difference in design between the two proposals I think it would be quite difficult to merge them into a coherent single proposal as is. That could of course change. My other concern is that keeping modules small promotes acceptance and implementation of them by tool vendors. It is in my opinion much more likely that a small and simple module will be implemented completely and correctly (or at all) than that a large and complex one will. I really believe that the chance of getting both implemented is bigger if they are split into manageable pieces. But I'm not strictly against trying to merge it if it turns out to result in a coherent feature supporting both current proposals.

>
> > [ryanki] If you have the following source “Hello Microsoft”
> > the tendency would be to use <mrk> to annotate it, or similarly, if 
> > I have “Hello %s”, the tendency might be to use <ph> to encode it.
> > However, both cases introduce markup into my source that I may have 
> > to normalize during recycling to get a 100% match. So having a noLoc 
> > rule is a way to provide a “cleaner, no post-processing needed”
> > source for recycling.
>
> But now you have also a "don't translate" information decoupled from 
> the segment that tools have to carry along with it. In many use cases 
> having the inline markup is simpler and easier to work with (e.g. send 
> the text to MT,
> etc.) Just thinking aloud here too.
>

I agree with Yves here that the rule type as proposed seems to overlap with the translate=no markup in the inline content. Since validation should be separate from the translation process, which I'll expand on bellow, I think it is more of an issue with name and description than a technical one. I would propose to rename "noLoc" to "mustExist" and be described as validating that certain text or text pattern exist in the target. Then it can be used for both the current use case but also for assuring specific localization in addition to no localization. And the ambiguity of the name is removed.

> cheers,
> -yves
>

When reading the proposal I could not find any dependence of the validation based on source content. In essence the expressions are only tested against the target. I think it would be useful to be able to place common rules on a higher level of the document and predicate them with a pattern matched against source. Like {if matches(source, "foo") then assert(matches(target, "bar"))} .

In the same vein of allowing more rules to be place on a higher level, one could consider if the multiplicity of matches should be specified for the expression or if expression writers should instead write expressions in such a way that the expression can only match a predefined number of times. The later can quickly become complicated but is not impossible. Just think about matching the word "hello" exactly three times in an arbitrary string rejecting it if there are four "hello" in it. If rules are predicated by source matches do we want the multiplicity to be communicated between the source match and the expected match multiplicity on the target?

An additional detail on the matching is that I do not see any information in the proposal about how it will relate to inline elements. Regex alone will not be able to take elements into account. I think SRX has some support of it, not sure how from the top of my head. At a minimum we need to specify how the expressions should match against what. Like if they match against the whole text of the target with inline elements removed, against each sub string delimited by inline elements or if there is some transform to replace the inline elements with text. Since I think allowing validation rules to take tagging into account it would be useful and in the spirit of how the Xliff format is intended to be used, if we could find a solution that support them. One possibility I have thought a little about is to allow match expressions to be specified using XPath instead or in addition to regular expressions. Using XPath 2.0 would allow including regular expressions in it. I don't have any clear details on how this would work .

Since rules are can be placed on the <segment> level I think it would be good to define how a processor merging or splitting segments should handle the rules if it supports this module. The core rules for processors not supporting this module would likely not be able to produce good results unless the requirement is to drop the validation rules, at least not without the source predicate bit. With that bit the rules could most probably live at the <unite> level to start with though. It might also be that the rules would become too complex and that no special rules for processors supporting this module should be provided and they should only use the core processing requirements for this task.

Lastly I think it should be clear that the validation is not supposed to replace or extend the normal behavior of placeable entities in the unit content provided by inline element markup. It should purely be a validation feature and there should not be any expectation that tools lock text or make specific text placeable in UI or other processes.

Regards,
Fredrik Estreen
Follow-Ups:
- RE: [xliff] 2.0 Validations Module Proposal
  - From: Kevin O'Donnell <kevinod@microsoft.com>
References:
- 2.0 Validations Module Proposal
  - From: Ryan King <ryanki@microsoft.com>
- RE: [xliff] 2.0 Validations Module Proposal
  - From: Ryan King <ryanki@microsoft.com>
- RE: [xliff] 2.0 Validations Module Proposal
  - From: "Estreen, Fredrik" <Fredrik.Estreen@lionbridge.com>
- RE: [xliff] 2.0 Validations Module Proposal
  - From: Ryan King <ryanki@microsoft.com>