Hi Fredrik, all,
I would be against promoting the idea that segmentation could use the units as the way to represent segmentation.
That would open the door to the same mess we have in 1.2 where, despite having a standard way to represent segmentation, some tools would use another way.
Since we have modules at the segment level, we have to define PR for the tools not supporting them. Therefore,
to me, there is no reason to forbid extension there. Segment-level properties are by nature related to the segmentation, if you change it, it make sense that those properties don’t apply anymore.
If the properties relates to the content of the segment then a best practice may be to promote the use of
<mrk> attached to an element that lives at the unit level: those data are safe during re-segmentation.
Looking at the current draft and some proposed modules more and more data is attached to or in the <segment> and <ignorable> elements. I think this is a bad design in the face of
re-segmentation. Any data placed as a descendant of those nodes MUST have processing requirements regarding how a tool should handle them if it perform re-segmentation. This obviously extended to attributes on <source> and <target> as well.
Re-segmentation is an arbitrary sequence of joining and splitting of these nodes. I personally feel that it will be close to impossible to specify processing requirements for anything
that is not fully defined in the core specification for these nodes. Modules that do not allow customization could have module specific PRs defined in the core but seem to overload the core with module details.
To limit the scope let’s consider <metaHolder> in <segment> and a generic tool without knowledge of the data stored in the <metaHolder>.
If we split the segment I see the following possible processing requirements:
a. Remove the meta holder
b. Keep the meta holder on the left hand <segment>
c. Keep the meta holder on the right hand <segment>
d. Copy the meta holder so it exist on both <segment>s
e. Forbid re-segmentation
If we join two adjacent segments both having <metaHolder>:
a. Remove both meta holders
b. Keep the meta holder from the left hand <segment>
c. Keep the meta holder from the right hand <segment>
d. Merge the two meta holders and recursively resolve key conflicts
a. Keep the left hand side value in the resultant metaHolder
b. Keep the right hand side value in the resultant metaHolder
c. Duplicate the keys and keep both values
d. Concatenate the values from both sides into one key/value
e. Forbid re-segmentation
For the core we need to choose ONE rule for split and ONE rule for join. It is unlikely that the chosen rule will be good for anything but a subset of the use cases. Or we could
add behavioral attributes and more complex PRs to <metaHolder>
If we extend this to custom schema extensions the merge cases likely become un-available as an XML tree merge of unknown content would be likely to cause schema violations. To allow
selection of PRs to apply we would need to add a mechanism to the core that extensions can use to select how it’s data should be treated.
The end result is that any application relying on extension data here or modules with third party customization would have to cope with the data being lost or in-accurate. Or we
severely limit the ability of processors to re-segment the contents of <unit>. Or finally that we end up with a much more complex core. Neither of these seem good to me.
My proposal is to not allow third party customizable data (either as full extensions or customizable modules) on these elements and try to limit the usage of these elements to hold
modules elements and attributes.
If there is a need from XLIFF extraction to have metadata or module information on what the initial extractor consider a segment it could put what it considers THE segment into
a <unit> and attach the metadata at that level. If subsequent tools preform re-segmentation the data on the <unit> level would still apply unchanged to the sum of segments in the <unit>.