xliff message

Subject: URI in XLIFF2

From: Yves Savourel <ysavourel@enlaso.com>
To: <xliff@lists.oasis-open.org>
Date: Thu, 21 Nov 2013 06:43:02 -0700

Hi all,

Here are some thoughts about the XLIFF 2.0 URIs to continue the discussion on that topic.

We have IDs in <file>, <group>, <unit>, <segment>/<ignorable>, <pc>/<sc>/<mrk>/<sm>, <note>, and <data>.

- The IDs of <file> must be unique within the document. But with the caveat that additional <file> can be added to the document
later on. This led us to tentatively say that maybe the file's id should be a UUID.
Note that it's not that easy to implement, for example I don't think XSLT has a way to create UUIDs. It would have to rely on
third-party extension for this. That may be the case in other programming languages.

- The IDs of <group> must be unique within the <file>

- The IDs of <unit> must be unique within the <file>. Those may or may not share the same ID scope as the groups.

- The IDs of <note> must be unique within the <file> (<note> can be at the <file>, <group> or <unit> level). So creating a new note
means using a UUID or knowing all notes' IDs in that given file.

- The IDs for <segment>/<ignorable> must be unique within the <unit>

- The IDs for <pc>/<ph>/<sc>/<mrk>/<sm> must be unique with the <unit> (with the source/target usual caveat). We have a tentative
agreement that those elements could share the <segment>/<ignorable> IDs scope (I'll refer to it as "segOrInlineIDs")

- The IDs for <data> must be unique within the <originalData>

A few additional constraints:

- Our "segOrInlineIDs" can be duplicated: one in the source the other in the target. The URI should be able to indicates which one
it points to.

- The Match module is bringing additional headache to the <unit>. A match has its own <source> and <target> and <data> elements. So
we'll have to somehow distinguish them from the "main" ones.

- Various modules (and obviously any extension) can use references as well. The Glossary is an example of this. Currently the
definition of id for glossentry does not offer scope information
(http://docs.oasis-open.org/xliff/xliff-core/v2.0/xliff-core-v2.0.html#gls_id). We'll have to resolve this somehow.


=== Using levels

A first potential solution is the one David described: using a hierarchy of IDs with a separator.

The separator is a separate question. It simply needs to be a character allowed in a URI but not allowed in an NMTOKEN. #, /, ~,
etc. would work. We just have to pick one. I'll use / throughout this email.

For example: #fileID/groupOrUnitID/segOrInlineID

A first, and I think show-stopper, issue: The notes can appear at different levels so it's not really possible to use them in such
hierarchy.


=== Using prefixes

Another potential solution could be to use prefixes along with a more flexible hierarchy.

For example: most IDs in the fragment would be represented like this: <prefixLetter>=IdValue

f for files
g for groups
u for units
n for notes
d for data
non-prefixed value would be the segOrInlineIDs

For example:

#f=550e8400-e29b-41d4-a716-446655440000/g=id1
-> the group id='id1' anywhere in the file id='550e8400-e29b-41d4-a716-446655440000'

#f=550e8400-e29b-41d4-a716-446655440000/u=id1
-> the unit id='id1' anywhere in the file id='550e8400-e29b-41d4-a716-446655440000'

#f=550e8400-e29b-41d4-a716-446655440000/n=id1
-> the note id='id1' anywhere in the file id='550e8400-e29b-41d4-a716-446655440000'

#f=550e8400-e29b-41d4-a716-446655440000/u=u1/s1
-> the segment id='s1' in the unit id='u1' anywhere in the file id='550e8400-e29b-41d4-a716-446655440000'

#f=550e8400-e29b-41d4-a716-446655440000/u=u1/m1
-> the annotation id='m1' in the unit id='u1' anywhere in the file id='550e8400-e29b-41d4-a716-446655440000'

#f=550e8400-e29b-41d4-a716-446655440000/u=u1/1
-> the code id='1' in the unit id='u1' anywhere in the file id='550e8400-e29b-41d4-a716-446655440000'

#f=550e8400-e29b-41d4-a716-446655440000/u=u1/d=d1
-> the data id='d1' in the unit id='u1' anywhere in the file id='550e8400-e29b-41d4-a716-446655440000'

We could maybe resolve the source/target issue with an final '~s' and '~t' after the segorInlineID value. The ~ would allow to
distinguish it from the ID value. For example:

#f=550e8400-e29b-41d4-a716-446655440000/u=u1/m1~s
-> the annotation id='m1' in the source content of the unit id='u1' anywhere in the file id='550e8400-e29b-41d4-a716-446655440000'

#f=550e8400-e29b-41d4-a716-446655440000/u=u1/s1~t
-> the target element in the segment id='s1' of the unit id='u1' anywhere in the file id='550e8400-e29b-41d4-a716-446655440000'

You could even imagine this working for the unit:

#f=550e8400-e29b-41d4-a716-446655440000/u=u1~t
-> the whole target content of the unit id='u1' anywhere in the file id='550e8400-e29b-41d4-a716-446655440000'. Not sure if it's
really useful or even needed, because it doesn't always correspond to a single physical element. Just a thought.

So far it seems it would work.

--- Now for relative fragment:

We could imply any missing part by the location of the reference attribute.

#n=n123
-> the note id='n123' in the current <file>

#u=234/10~s
-> the source inline code or segment/ignorable id='10' in the unit id='123' in the current file.

We would have invalid values for the cases where the position of the reference attribute does not provide the proper context. For
example: #10~s used at a group level would not be valid as there is no unit context.

I think it would be relatively easy to implement for most applications. But the solution requires a relatively complex parsing of
the fragment. Bryan will have to see if XSLT can support such mechanism.


--- Modules

Now there is the issue of the modules.

A possible option is to require two things:

- a) any ID in a module must be set using the attribute id in the module/extension namespace (An evensimpler alternative would be to
require xml:id)

- b) any ID value in a module to be a UUID

We could then use a special prefix for it:

#f=550e8400-e29b-41d4-a716-446655440000/m=47ab0064-d9d4-4ef9-9805-c3ad88f0bae6
-> the module/extended element id='47ab0064-d9d4-4ef9-9805-c3ad88f0bae6' anywhere in the file
id=550e8400-e29b-41d4-a716-446655440000

This guaranties even the core can find such ID and it can be referenced uniquely within each file.

That's not pretty and it comes with the issue of generating UUID for some programming languages. But I can't think of another
solution so far.

--- Matches

The solution above still does not handle using <source>/<target>/<data> in matches. Technically you could have the same Ids used in
the match elements and in the unit where these matches are.


Still thinking...
Cheers,
-ys

Follow-Ups:
- RE: [xliff] URI in XLIFF2
  - From: Yves Savourel <ysavourel@enlaso.com>

References:
- Re: [xliff] IDs - Optional attributes (E)
  - From: "Dr. David Filip" <David.Filip@ul.ie>
- Re: [xliff] IDs - Optional attributes (E)
  - From: "Dr. David Filip" <David.Filip@ul.ie>
- RE: [xliff] IDs - Optional attributes (E)
  - From: Yves Savourel <ysavourel@enlaso.com>
- Re: [xliff] IDs - Optional attributes (E)
  - From: "Dr. David Filip" <David.Filip@ul.ie>
- RE: [xliff] IDs - Optional attributes (E)
  - From: Yves Savourel <ysavourel@enlaso.com>
- Re: [xliff] IDs - Optional attributes (E)
  - From: "Dr. David Filip" <David.Filip@ul.ie>
- RE: [xliff] IDs - Optional attributes (E)
  - From: Yves Savourel <ysavourel@enlaso.com>
- Re: [xliff] IDs - Optional attributes (E)
  - From: "Dr. David Filip" <David.Filip@ul.ie>
- RE: [xliff] IDs - Optional attributes (E)
  - From: Yves Savourel <ysavourel@enlaso.com>
- Re: [xliff] IDs - Optional attributes (E)
  - From: "Dr. David Filip" <David.Filip@ul.ie>