RE: [xliff] 2.0 Validations Module Proposal

Thanks Yves for the comments and feedback. See our response inline.

From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] On Behalf Of Yves Savourel
Sent: Friday, November 16, 2012 6:01 AM
To: xliff@lists.oasis-open.org
Subject: RE: [xliff] 2.0 Validations Module Proposal

Hi Ryan, all,

I think a validation module would be quite nice to have.

It would allow catching many issues where they really need to be caught: when translating.

A few notes of things to possibly consider:

- What regular _expression_ syntax should the module use? ICU?, .NET?, Perl?, Java? XSD? ECMA? other?
for interoperability purpose this is quite important to have a well defined way to write the regexes.
I don’t have an answer. it’s just that there are precedents in SRX and ITS that demonstrate the problem is not easy to solve.

[ryanki] Good question. Maybe an attribute should be added to allow a user to define the regex language to use?

- I notice the maxLength rule. How this would fit with the proposal that Fredrik put forward about length and size restriction?
see https://wiki.oasis-open.org/xliff/XLIFF2.0/Feature/Length%20and%20Size%20Restrictions
Or with the ITS Storage Size data category that would be in an ITS module.
Somehow we would have to make sure there is one way to check one thing.

[ryanki] Since maxLength is just another type of validation, we would advocate replacing it with the more general <validations> module.

- Maybe the ‘custom rule’ could be defined with a clearer PR. For example, the case of the email pattern doesn’t tell you if there is a problem. Maybe a more generic way to work with custom pattern could be to see of a pattern in the source matches the same number of occurrences in the target. For the email example, it would mean a red flag if the email is not found in the target.
One could have more sophisticated options too, like have a pattern for both the source and the target.
Checker tools like XBench, QADistiller, etc. have put a lot of thoughts into this. It would be nice to have equivalence.

[ryanki] Along with “well-known” rules like noLoc, maxLength, minLength, etc. there should just be a generic one defined called matchStatus (or something) where only the success or failure of the match can be acted upon. Your example of source-target comparison should probably be specified as one of the well-known rules. Leaving “true” custom rules to be defined with an x- prefix that could be safely ignored by tools that don’t know anything beyond the “well-known” set.

- It seems noLoc would be very similar to <mrk id='1' translate='no'>...</mrk> A rational to justify both method would be nice to offer to the implementers.

[ryanki] If you have the following source “Hello Microsoft” the tendency would be to use <mrk> to annotate it, or similarly, if I have “Hello %s”, the tendency might be to use <ph> to encode it. However, both cases introduce markup into my source that I may have to normalize during recycling to get a 100% match. So having a noLoc rule is a way to provide a “cleaner, no post-processing needed” source for recycling.

That’s all I have for now.

-yves

From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] On Behalf Of Ryan King
Sent: Thursday, November 15, 2012 5:01 PM
To: xliff@lists.oasis-open.org
Subject: [xliff] 2.0 Validations Module Proposal

In anticipation of closing down on 2.0, we have two new proposals for modules. In this mail, we are proposing the first of the two, a Validation module.

Validating localized target data is a very important part of the business of outsourcing localization, especially when the extracted source content comes from software. Typically, there is a plethora of tools that content providers and localization suppliers use to perform a multitude of validations. There is a strong desire in the industry to bring some consistency to this space, but there are currently no accepted standards or interchange formats that facilitate this activity. We would like to propose a Validation module that would help with standardizing this crucial activity.

The basic idea would be to define a small set of standard validation rules and standard descriptions for them that tool developers could consistently build business logic around. How a rule is applied to a string or sub-string would be done using regular expressions. These would all be contained in a Validations module.

Here’s a draft of the Module for comment:

Validations Module
The target text of a document can be verified against various validation rules. The Validations Module should be able to store a list of pre-defined validation rules, along with a description about how to process the target text using those rules, to perform specific verifications.

Module Specification

Module Namespace

The namespace for the Verification module is: urn:oasis:names:tc:xliff:validations:2.0

Module Elements

The elements defined in the Validations module are: <validations>, <validation>, and <matchExpression>.

Tree Structure

Legend:

1 = one + = one or more ? = zero or one

<validations> +
|
+---<validation> +

+---<matchExpression> 1

validations

Collection of validations to be applied by a validation engine

Contains:

- One or more <validation> elements

Parents:

Attributes:

- name

validation

Specifies a validation rule, and a description and regular _expression_, which define how to apply that validation rule to the target text.

Contains:

- One <matchExpression> element

Parents:

Attributes:

- id, rule, desc

matchExpression

A regular _expression_ used to match the target text or substring to which the validation rule is applied.

Contains:

A regular _expression_

Parents:

Attributes:

- none

Module Attributes

The attributes defined in the Validations module are: name, id, rule, and desc.

name
Name – The user-defined name of a named validations element.

Value description: NMTOKEN.

Default value: undefined

Used in: <validations>.

id

Identifier - A character string used to identify a <validation> element.

Value description: NMTOKEN.

Default value: undefined

The value must be unique within the <validations> element.

Used in: <validation>.

rule

Validation Rule - Indicates the rule that a validation engine should apply to the target text.

Value description: A paired value with desc. See table below.

Default value: undefined

Used in: <validation>

desc

Validation description – indicates how a specific rule should be applied to the target text.

Value description: A paired value with rule. See table below.

Default value: undefined

Used in: <validation>.

Possible values for rule and desc attributes (format and number of rules TBD):

Rule	Description
maxLength:100	Match string can’t be longer than # of chars specified.
minLength:10	Match string can’t be shorter than # of chars specified.
noLoc	Match string shouldn’t be localized
Etc.	Etc.
Any custom rule	Any custom description

Examples in XLIFF:

Using the following segment as an example

<segment>
<source> Contact me at someCompany: user@somecompany.com</source>
<target> Kontaktieren Sie mich unter someFirma: user@somecompany.com</target>
</target>

maxLength:100
. Matches “Kontaktieren Sie mich unter someFirma: user@somecompany.com“.

Match succeeds, so validation business logic checks to see if the string is less than 100 chars, that also succeeds, and the business logic then takes the appropriate action.

<val:validations>

</validation>

</val:validations>

noLoc

\bsomeCompany\b doesn’t match “someCompany” in the target text.

Validation business logic takes the appropriate action for the match failure.

<val:validations>

<matchExpression>\bsomeCompany\b</matchExpression>

</validation>

</val:validations>

Rules not defined in the Module can still be defined using the same mechanisms, though user agents that support the Validation Module may or may not have built-in implementation for them. An example might be to check if the target text contains a valid email address.

validEmail
\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b matches “user@somecompany.com”.

Validation business logic takes appropriate action for the match success.

<val:validations>

</validation>

</val:validations>

Please let us know your opinion on this proposal.

Thanks,

Microsoft Corporation

(Ryan King, Kevin O'Donnell, Uwe Stahlschmidt, Alan Michael)

xliff message

Module Namespace

Module Elements

Tree Structure

validations

validation

matchExpression

name Name – The user-defined name of a named validations element.

id

rule

Possible values for rule and desc attributes (format and number of rules TBD):

name
Name – The user-defined name of a named validations element.