xliff message

Subject: Removing annotations (and IDs again)
From: Yves Savourel <ysavourel@enlaso.com>
To: <xliff@lists.oasis-open.org>
Date: Tue, 12 Nov 2013 07:14:36 -0700
Hi all,

While trying to implement removal of annotations I've run into a set of issues.

We have the following processing requirement in the specification:

[[
When a Modifier removes an <mrk> element or a pair of <sm> / <em> elements and the ref attribute is present, it MUST check whether
or not the URI referenced by the ref attribute is within the same <unit> as the removed element. If it is and no other element has a
reference to the referenced element, the Modifier MUST remove the referenced element.
]]

--- Value of the fragment identifier

The first issue is how to find the resource identified by the URI within the unit? In practice to point to a resource that is within
the same document we have used URIs made of only the # and the fragment identifier (e.g. ref="#m1").

I have not yet been able to find a formal definition of fragment identifier for XML, but various documents points to similar
answers: in XML a fragment identifier seems to follow the HTML convention that it should be the value of an id or name attribute and
"...must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ('-'), underscores
('_'), colons (":"), and periods ('.')."

This said, RFC-3986, which defines the URI's generic syntax, leaves the definition of the fragment identifier to each MIME type
(http://tools.ietf.org/html/rfc3986#section-3.5):

"Individual media types may define their own restrictions on or structures within the fragment identifier syntax for specifying
different types of subsets, views, or external references that are identifiable as secondary resources by that media type."

So we could define a specific fragment identifier syntax for XLIFF if we need, to allow NMTOKEN values and deal with uniqueness
issues. This way finding a resource from within the document as well as from a full URI is clear.

I think this also affect modules like Resource Data, Glossary, etc. Anywhere we have IDs.


--- How to find the identifier

Assuming we have resolved the identifier value issues, now we still need to find it among the elements in the unit. There is no
rules currently that force a specific name for the attribute/element that hold the ID value. It's likely be to be an attribute named
id, but nothing prevent modules and extensions to use something else.

A possible solution here would be to enforce the use of an attribute named id. So, even the core, unaware of extension namespaces
could search properly for the IDs.


--- How to know if the resource is used elsewhere

The part "... and no other element has a reference to the referenced element" seems rather difficult to execute. The referenced
element may be referenced by anything (module, extension) using very different means. I don't see how a core processor could detect
such references.

Here again we may have to limit what an extension could do. But this seems very restrictive. Maybe this clause is simply to
complicated to implement. But this would make the whole processing requirement just about impossible to implement too.


--- The case of pointing to the annotation

The processing requirement talks about using the ref attribute of the annotation to point to another element, but, as the discussion
about matches underlined (cspr02 comment 111), we could also have elements pointing to the annotation.

The problem is the same: we have no way to know which attribute/element to look at.

A possible solution for this would be to introduce a core attribute used only in modules/extensions that has the semantics of
pointing to an annotation. For example, imagining the match module would use a pointer to the annotation rather than the reverse:

<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0"
 xmlns:xlf="urn:oasis:names:tc:xliff:document:2.0">
 ...
<unit id='1'>
 <segment>
  <source><mrk id='m1' type='m:match'>text</mrk></source>
 </segment>
 <m:matches xmlns:m='urn:oasis:names:tc:xliff:matches:2.0'>
  <m:match id='1' xlf:ref='m1'>
   <source>text</source>
   <target>texte</target>
  </m:match>
 </m:matches>
</unit>


--- Conclusion

a) I think we need to look closer at the ref value and define some fragment identifier behavior for the XLIFF mime type. And
register an XLIFF mime type. We need to be able to answer: what is the URI for a given XLIFF element? 

b) The processing requirement seems extremely difficult to implement. A core processor should still be able to remove annotations,
and that may result in inoperable module/extensions.

Thoughts?
-yves
Follow-Ups:
- Re: [xliff] Removing annotations (and IDs again)
  - From: "Dr. David Filip" <David.Filip@ul.ie>