xliff message

Subject: Comments on Fragment Identification
From: Yves Savourel <ysavourel@enlaso.com>
To: <xliff@lists.oasis-open.org>
Date: Sun, 1 Dec 2013 05:44:48 -0700
Hi all,

Two things in this email:

1. A few comments on the new section David created
2. Another solution



1) ===== Comments on the new proposed section "3 Fragment Identification":


-- (minor) Maybe the title could be more specific: "URI Fragment identifiers"


-- Maybe we could start the section by something else than:

[[
XLIFF Module fragment identification prefixes are specified in the resective modules.
]]

Maybe something like:

[[
Because XLIFF documents do not follow the usual behavior of XML documents when it comes to element identifiers, this specification
defines how applications must interpret the fragment identifiers in URIs pointing to XLIFF documents.
]]

-- We need a MIME type for XLIFF.
I believe David started the request, but I'm not sure. David?


-- I'm not sure I understood everything correctly in this section because there are no examples to illustrate the definitions.


-- For internal references, if I understand correctly the statement "Only referencing within the lowermost of the enclosing <unit>
or <file> is allowed":
This means the proposal allows only for very limited internal references, for example one cannot point from a mrk-ref to an element
outside the unit where that mrk is located.
If it's true, to me that's a show-stopper: it reduces drastically what you can do with annotation for example.


--- I'm not sure what we define for the internal reference:

One case starts with '#', the other starts with a module prefix (which seem all to start with '/').

So, far as I can tell, we would have: ref="##id" and ref="#/ref#id" (since a I assume "the fragment identifying string" means the
part after the # in a URI. Is that correct?

If my assumption is not correct then surly the only other possible interpretation is that the syntax is ref="#id" and
ref="/ref#id1". But that can't be right: the second case would be interpreted as a fragment identifier equals to "id1".


-- I've noticed that the proposal says: "IRI of the referenced document with the xlf extension". We should not limit the document to
xlf extensions. That's the recommended one, but one can use anything.


-- I don't think using # as a separator is wise.
It is already use to separate the fragment identifier from the rest of the URI.
 
It seems also to cause problem: if I do the following in Java:

assertEquals("id1#id2", new URI("http://www.test.net/file.xlf#file1#unit1";).getFragment());

I get the following exception: java.net.URISyntaxException: Illegal character in fragment at index 34:
http://www.test.net/file.xlf#file1#unit1

So # as separator inside the fragment looks really bad to me.

I think / would work better as it's a traditional separator for parts/path.


-- It seems that for modules/extension the ID can be set in a id attribute, or in a name attribute. 
a) Why allow two attributes?
b) also I have not seen any new PR that requests that all modules/extensions to use id (or name) attributes for their ID values.
c) and I have not seen any new PR that requests the id values of extensions use a character set compatible with a URI fragment (e.g.
NMTOKEN)


-- The file id attribute is to be unique per document. But that doesn't cover the use case of bundling several <file> into a single
document after extraction.
When a clash occurs: can the tool modify those file IDs? (no PR prevent it)
Or should we use UUID values for file id?


-- By "If the fragment to be identified is within an XLIFF Module's data," I assume you mean "... within a module element" (not sure
what "module's data" is)


-- There seems to be no definition of the rules to build a module/extension prefix. The text says that module prefixes are defined
in each module specification, but we need more than that: identification must work also for custom extensions since many may become
modules.
Once again: modules and extension should be treated equally from the core viewpoint.


-- The proposal has no provision for distinguishing source from target for the inline elements.
We can point to an inline element with an id='abc', but we don't know if it's the one in source or target.


-- The distinction between internal and external is very strange.

It means you can have this: "myFile.xlf#id1" and "#id1" and the two "id1" points to different places.

It also means you have different valid identifiers depending on their internal/external status. For example "myFile.xlf#f1#g1#i1" is
valid but "#f1#g1#i1" is not (if I understand correctly).

That make things quite confusing, and I've also never seen any fragment identifier making such distinction.
I think it's important to keep the same syntax and semantic for all the fragments, whether or not they are part of a full or
relative URI.


-- The 3 levels of IDs force un-natural scopes for IDs:

For example:
- The <note> elements have different scope if they are in or out of a unit;
- <data>'s id is in the same scope as inline codes/markers;
- We force <group> and <unit> to share the same Id space.

All this is very restrictive and will cause a lot of overhead in the implementation where the object model of the extracted document
may be very different and therefore accessing existing IDs to create new objects can be a lot different than in XLIFF.

We have to remember that XLIFF is not a processing format, just a exchange one.



2) ===== Other solution

I have started to proposed a different solution a while back in this email:
https://lists.oasis-open.org/archives/xliff/201311/msg00131.html

I don't like it very much, but it seems better than the proposal currently in the draft.

I'm not sure about the source/target flag and would like to hear back for that.

I'm not sure how to deal with modules/extensions differently than what's outlined in the email.

I'm still not sure what is the good solution for the <file> ids: should they be a UUID or not.

I think we should express whatever fragment identifier syntax in a clear ABNF-like notation rather than statments.

I think we should try to offer a regular expression to validate whatever we came up with.



Below is a try at a more formal definition.
Note that it doesn't have provision for modules or source/target at this point.

fragId =  withFile / withGroupOrUnitOrNote / inlineOrDataPart

withFile = filePart 1*("/" withGroupOrUnitOrNote)

filePart = "f=" fileId

fileId = value of the id attribute of one of the <file> elements in the document

withGroupOrUnitOrNote = notePart / groupPart / withUnit 

notePart = "n=" noteId

noteId = value of the id attribute of one of the <note> elements in the parentFile

parentFile = the <file> element identified by filePart when available, otherwise the <file> element where the fragment identifier is
used

groupPart = "g=" groupId

groupId = value of the id attribute of one of the <group> elements in the parentFile

withUnit = unitPart 1*("/" inlineOrSataPart)

unitPart = "u=" unitId

unitId = value of the id attribute of one of the <unit> elements in the partFile

inlineOrDataPart = inlineId / dataPart

inlineId = value of a <segment>, <ignorable>, <mrk>, <sm>, <pc>, <sc>, <ec>, or <ph> element in the parentUnit

parentUnit = the <unit> element identified by unitPart when available, otherwise the <unit> element where the fragment identifier is
used.

dataPart = "d=" dataId

dataId = value of the id attribute of one of the <data> elements in the parentUnit


There are examples of the fragments in the initial email:
https://lists.oasis-open.org/archives/xliff/201311/msg00131.html


cheers,
-yves
Follow-Ups:
- Re: [xliff] Comments on Fragment Identification
  - From: "Dr. David Filip" <David.Filip@ul.ie>