xliff message

Subject: RE: [xliff] 2.0 Validations Module Proposal
From: Yves Savourel <ysavourel@enlaso.com>
To: <xliff@lists.oasis-open.org>
Date: Fri, 30 Nov 2012 05:57:52 -0700
Hi David, Kevin, all,


> Well, this was extensively discussed on ITS, but the use 
> case was just a subset of the validations we would cover.
> ITS 2.0 is able to specify allowed characters using a regexp.
> @Yves, would you care to summarize on what regexp set the group consolidated?

Currently the syntax for the ITS Allowed Character feature is a regular expression as defined in the Character Class of XML Schema.

But, as David noted, the context for this ITS feature and the validation feature we are talking about here are different: Allowed Characters needs only a limited regular expression to work, while the validation case we need a full-fledge regular expression mechanism.

In Allowed Characters a sub-set of the current syntax could be use and a) be enough to work in all use cases, and b) be common to most regular expression engines. The fact that ITS is not using that solution is not understandable to me. But that is irrelevant in our case.

In the validation feature case using XPath or XQuery has the advantage to be a unique common syntax, but has, in my opinion, the same disadvantages as any other regular expression engine: it may not be available on all programming languages, and it may be difficult to implement when the tools are working outside of XLIFF, which is *just an exchange format*.


> ...Despite having XLIFF represented in a proprietary DB structure, the XLIFF processor 
> must be able to recreate the XLIFF no later than for hand back. 

That is incorrect: most modern software tools work based on components. In the case of a validation component, that component is likely to not know anything about XLIFF because that's the job of another component to import the XLIFF document into the system's own document model.


> ...If it is too cumbersome for them to generate the XLIFFs or XLIFF fragments 
> for validation purposes on the fly, nothing prevents them to interpret the 
> XPath/XQuery validation rule into an Regex friendly to their particular platform.

Indeed, one can do anything in programming. But there is a big difference between 'nothing prevent you to do...' and doing it.
To illustrate this: our tools set will not have a conformant implementation of the ITS allowed Characters feature if it ends up using today's choice of syntax.
No matter how much a standard must look at the users interest, ultimately the implementers decide how far they are willing to go to accommodate interoperability.


> ...There are also tools that use XLIFF as its native processing format,
> and for them obviously the XPath or XQuery methods are the simplest.

I recall explicitly stating long ago for the record that the TC members must never forget that XLIFF is *only* a *tool neutral exchange* format.
It is not intended to be used as the native representation of any tool.
Some tool do use it natively. Great: more power to them. But that is irrelevant for the TC.
Arguing that a feature is better done one way because it's easier for the tools that use XLIFF natively is, in my opinion, a big misstep: It demonstrates that one's judgment for anything XLIFF is not completely in line with its main goal: being an exchange format. It's very hard to not make that misstep. But that's ok as long as someone raise the red flag when that happens.

This said, using XPath or XQuery because a tool works with XML (not specifically XLIFF) is a valid argument.

The main issue is that with the validation feature we are drifting away from XLIFF being a mere representation of the source/target content to storing metadata that cannot really be tool agnostic. It runs against the "tool neutral" aspect of XLIFF. That is where the borderline between tool-specific extensions and modules becomes blurred. I'm not sure there are good solutions for this besides identifying the type of data used, in this case the regular expression syntax.

Regards,
-yves








XQuery would be SQL friendlier, so that it would better facilitate an SQL based XLIFF processor transofrming the validation rules into a format that it can use natively.

Cheers
dF


Dr. David Filip
=======================
LRC | CNGL | LT-Web | CSIS
University of Limerick, Ireland
telephone: +353-6120-2781
cellphone: +353-86-0222-158 
facsimile: +353-6120-2734
mailto: david.filip@ul.ie


On Fri, Nov 30, 2012 at 10:50 AM, Dr. David Filip <David.Filip@ul.ie> wrote:
Well, this was extensively discussed on ITS, but the use case was just a subset of the validations we would cover.
ITS 2.0 is able to specify allowed characters using a regexp.
@Yves, would you care to summarize on what regexp set the group consolidated?
I guess that the allowed characters are an important use case and it would be good to be consistent here.

Cheers
dF


Dr. David Filip
=======================
LRC | CNGL | LT-Web | CSIS
University of Limerick, Ireland
telephone: +353-6120-2781
cellphone: +353-86-0222-158 
facsimile: +353-6120-2734
mailto: david.filip@ul.ie


On Fri, Nov 30, 2012 at 1:34 AM, Kevin O'Donnell <kevinod@microsoft.com> wrote:
Hi Yves,

Thanks for your input; you make a valid point with (b) below. Indeed, we recognize that XPath could prove troublesome when dealing with validation outside of the XLIFF document (which is a typical scenario).

Given this, from our perspective, RegEx would be a more suitable choice for this proposal, as originally stated. Of course, it has some limitations, but we believe it's important to have a degree of certainty with the validation module, to have consistent implementation and support for validation in a wide variety of tools. If we leave the choice of rule engine open, we risk having no consistent support for rules. That said, it's not an easy choice to decide which RegEx engine(s) to officially support. We need some time to research the appropriate solution - perhaps others have opinions on worthy selections?

Thanks,
Kevin.


-----Original Message-----
From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] On Behalf Of Yves Savourel
Sent: Thursday, November 29, 2012 12:59 PM
To: xliff@lists.oasis-open.org
Subject: RE: [xliff] 2.0 Validations Module Proposal

Hi Ryan, Fredrik, all,

While using an XPath expression may have the nice effect of eliminating the problem of the different Regex engines/syntaxes, it has:

a) the (minor) drawback of still having to deal with potentially several versions of XPath, as Ryan noted.

and b) it is very difficult to implement when not working directly on the XLIFF documents.
It seems we tend often to forget that XLIFF is an *exchange* format and most tools on both ends will read the document in a structure/database that has nothing to do with XML. Sure one can convert back the entry into an XML fragment and apply the validation, but that is a lot of work.

I can't think of a good solution for this, but I'm also not sure XPath is a better solution than regex: it solves some issues but bring new ones.

cheers,
-ys

-----Original Message-----
From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] On Behalf Of Ryan King
Sent: Thursday, November 29, 2012 12:56 PM
To: Estreen, Fredrik; Yves Savourel; xliff@lists.oasis-open.org
Subject: RE: [xliff] 2.0 Validations Module Proposal

Frederick, thanks for the constructive feedback. It seems that there has been some misconception that the original examples given were comprehensive, or the ones that should necessarily be defined, but they weren't. They only served as examples of the direction we might take. The original intent was to provide a validations module that would 1) Define a standard set of rules and descriptions that tool implementers could build business logic around to apply to the string or substring indicated  by the match expression (defining those standard rules and what elements they affect would have been a wider committee activity) and 2) In the absence of a standard rule and description, allow a "custom rule", which would essentially be nothing more than the pass/fail of the match expression, and the action taken would be left to the user agent.

In the meantime, we have looked at what the current inline markup in the spec offers and it does cover much of our current validation needs (though we still have the issue of normalizing strings for recycling, which we will need to deal with). Using XPath also sounds promising and easier for authors to write and maintain, plus it has the distinction of being a w3c standard (but we may need to think about 1.0 vs. 2.0 since 1.0 is in wider use). So, in the interest of time and simplicity, here is how we propose to move forward in this version of the standard: Allow users to store XPath expressions, an identifier (e.g. a rule name), and possibly a description in the module. Tool implementers could evaluate the success/failure of the match and take some action for that particular identifier according to their tool's business logic based on the description. The module can live at the <file>, <group>, or <unit> level, and because rules may indeed become complex, only standard core processing rules would apply.

<val:validations>
  <validation rule=”mustExist” desc=”Match string should exist in both source and target.”>{XPath Expression}</validation>  </val:validations>

It will be up to further discussion and debate if we want to define a "standard" set of rules and descriptions in the module, which don't overlap with current inline markup, or simply allow users to define their own set as a first implementation of the module.

Can we move forward with a vote on this one to approve or not approve?

Thanks,
Ryan

-----Original Message-----
From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] On Behalf Of Estreen, Fredrik
Sent: Wednesday, November 28, 2012 7:13 AM
To: Ryan King; Yves Savourel; xliff@lists.oasis-open.org
Subject: RE: [xliff] 2.0 Validations Module Proposal

Hi Ryan, Yves,

I have been thinking about this proposal for some time now and have a few suggestions and comments. First I'd like to point out that I'm really in favor of a validation feature in the standard. But to make it part of the standard we need to ensure it is providing an interoperable baseline and works well with the other features in the standard. I have add to the discussion inline bellow and put some additional thoughts at the end.

> -----Original Message-----
> From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org]
> On Behalf Of Yves Savourel
> Sent: den 20 november 2012 03:06
> To: xliff@lists.oasis-open.org
> Subject: RE: [xliff] 2.0 Validations Module Proposal
>
> > [ryanki] Good question. Maybe an attribute should be added to allow
> > a user to define the regex language to use?
>
> That one step toward better interoperability and at the same time,
> possibly toward less:
> Tools can identify what is the regex syntax, but now they have to
> implement more than one syntax.
> I don't have an answer, just thinking aloud.
>


I agree that we need to specify either an existing flavor (or possibly a set) as the one used in the standard. And if several are allowed provide a way to specify which is in use. The other option is to go the long route of actually specifying what features and behavior we require of the regex processor and the provided syntax. The later will give a more independent standard but will most likely be harder on the implementers unless we restrict ourselves to a small common subset of regex.

>
> > [ryanki] Since maxLength is just another type of validation, we
> > would advocate replacing it with the more general <validations> module.
>
> That's a good argument.
> Based on the module proposal Fredrik just posted it seems a complex one.
> maybe maxLength is just one of the many profiles?
> just thinking aloud here too.
>

Given the difference in design between the two proposals I think it would be quite difficult to merge them into a coherent single proposal as is. That could of course change. My other concern is that keeping modules small promotes acceptance and implementation of them by tool vendors. It is in my opinion much more likely that a small and simple module will be implemented completely and correctly (or at all) than that a large and complex one will. I really believe that the chance of getting both implemented is bigger if they are split into manageable pieces. But I'm not strictly against trying to merge it if it turns out to result in a coherent feature supporting both current proposals.

>
> > [ryanki] If you have the following source “Hello Microsoft”
> > the tendency would be to use <mrk> to annotate it, or similarly, if
> > I have “Hello %s”, the tendency might be to use <ph> to encode it.
> > However, both cases introduce markup into my source that I may have
> > to normalize during recycling to get a 100% match. So having a noLoc
> > rule is a way to provide a “cleaner, no post-processing needed”
> > source for recycling.
>
> But now you have also a "don't translate" information decoupled from
> the segment that tools have to carry along with it. In many use cases
> having the inline markup is simpler and easier to work with (e.g. send
> the text to MT,
> etc.) Just thinking aloud here too.
>

I agree with Yves here that the rule type as proposed seems to overlap with the translate=no markup in the inline content. Since validation should be separate from the translation process, which I'll expand on bellow, I think it is more of an issue with name and description than a technical one. I would propose to rename "noLoc" to "mustExist" and be described as validating that certain text or text pattern exist in the target. Then it can be used for both the current use case but also for assuring specific localization in addition to no localization. And the ambiguity of the name is removed.

> cheers,
> -yves
>

When reading the proposal I could not find any dependence of the validation based on source content. In essence the expressions are only tested against the target. I think it would be useful to be able to place common rules on a higher level of the document and predicate them with a pattern matched against source. Like {if matches(source, "foo") then assert(matches(target, "bar"))} .

In the same vein of allowing more rules to be place on a higher level, one could consider if the multiplicity of matches should be specified for the expression or if expression writers should instead write expressions in such a way that the expression can only match a predefined number of times. The later can quickly become complicated but is not impossible. Just think about matching the word "hello" exactly three times in an arbitrary string rejecting it if there are four "hello" in it. If rules are predicated by source matches do we want the multiplicity to be communicated between the source match and the expected match multiplicity on the target?

An additional detail on the matching is that I do not see any information in the proposal about how it will relate to inline elements. Regex alone will not be able to take elements into account. I think SRX has some support of it, not sure how from the top of my head. At a minimum we need to specify how the expressions should match against what. Like if they match against the whole text of the target with inline elements removed, against each sub string delimited by inline elements or if there is some transform to replace the inline elements with text. Since I think allowing validation rules to take tagging into account it would be useful and in the spirit of how the Xliff format is intended to be used, if we could find a solution that support them. One possibility I have thought a little about is to allow match expressions to be specified using XPath instead or in addition to regular expressions. Using XPath 2.0 would allow including regular expressions in it. I don't have any clear details on how this would work .

Since rules are can be placed on the <segment> level I think it would be good to define how a processor merging or splitting segments should handle the rules if it supports this module. The core rules for processors not supporting this module would likely not be able to produce good results unless the requirement is to drop the validation rules, at least not without the source predicate bit. With that bit the rules could most probably live at the <unite> level to start with though. It might also be that the rules would become too complex and that no special rules for processors supporting this module should be provided and they should only use the core processing requirements for this task.

Lastly I think it should be clear that the validation is not supposed to replace or extend the normal behavior of placeable entities in the unit content provided by inline element markup. It should purely be a validation feature and there should not be any expectation that tools lock text or make specific text placeable in UI or other processes.

Regards,
Fredrik Estreen



---------------------------------------------------------------------
To unsubscribe, e-mail: xliff-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: xliff-help@lists.oasis-open.org
Follow-Ups:
- RE: [xliff] 2.0 Validations Module Proposal
  - From: Ryan King <ryanki@microsoft.com>
References:
- 2.0 Validations Module Proposal
  - From: Ryan King <ryanki@microsoft.com>
- RE: [xliff] 2.0 Validations Module Proposal
  - From: Ryan King <ryanki@microsoft.com>
- RE: [xliff] 2.0 Validations Module Proposal
  - From: "Estreen, Fredrik" <Fredrik.Estreen@lionbridge.com>
- RE: [xliff] 2.0 Validations Module Proposal
  - From: Ryan King <ryanki@microsoft.com>
- RE: [xliff] 2.0 Validations Module Proposal
  - From: Kevin O'Donnell <kevinod@microsoft.com>
- Re: [xliff] 2.0 Validations Module Proposal
  - From: "Dr. David Filip" <David.Filip@ul.ie>
- Re: [xliff] 2.0 Validations Module Proposal
  - From: "Dr. David Filip" <David.Filip@ul.ie>