Subject: Data associated with <segment>
Looking at the current draft and some proposed modules more and more data is attached to or in the <segment> and <ignorable> elements. I think this is a bad design in the face of re-segmentation. Any data placed as a descendant of those nodes MUST have processing requirements regarding how a tool should handle them if it perform re-segmentation. This obviously extended to attributes on <source> and <target> as well.
Re-segmentation is an arbitrary sequence of joining and splitting of these nodes. I personally feel that it will be close to impossible to specify processing requirements for anything that is not fully defined in the core specification for these nodes. Modules that do not allow customization could have module specific PRs defined in the core but seem to overload the core with module details.
To limit the scope let’s consider <metaHolder> in <segment> and a generic tool without knowledge of the data stored in the <metaHolder>.
If we split the segment I see the following possible processing requirements:
a. Remove the meta holder
b. Keep the meta holder on the left hand <segment>
c. Keep the meta holder on the right hand <segment>
d. Copy the meta holder so it exist on both <segment>s
e. Forbid re-segmentation
If we join two adjacent segments both having <metaHolder>:
a. Remove both meta holders
b. Keep the meta holder from the left hand <segment>
c. Keep the meta holder from the right hand <segment>
d. Merge the two meta holders and recursively resolve key conflicts
a. Keep the left hand side value in the resultant metaHolder
b. Keep the right hand side value in the resultant metaHolder
c. Duplicate the keys and keep both values
d. Concatenate the values from both sides into one key/value
e. Forbid re-segmentation
For the core we need to choose ONE rule for split and ONE rule for join. It is unlikely that the chosen rule will be good for anything but a subset of the use cases. Or we could add behavioral attributes and more complex PRs to <metaHolder>
If we extend this to custom schema extensions the merge cases likely become un-available as an XML tree merge of unknown content would be likely to cause schema violations. To allow selection of PRs to apply we would need to add a mechanism to the core that extensions can use to select how it’s data should be treated.
The end result is that any application relying on extension data here or modules with third party customization would have to cope with the data being lost or in-accurate. Or we severely limit the ability of processors to re-segment the contents of <unit>. Or finally that we end up with a much more complex core. Neither of these seem good to me.
My proposal is to not allow third party customizable data (either as full extensions or customizable modules) on these elements and try to limit the usage of these elements to hold modules elements and attributes.
If there is a need from XLIFF extraction to have metadata or module information on what the initial extractor consider a segment it could put what it considers THE segment into a <unit> and attach the metadata at that level. If subsequent tools preform re-segmentation the data on the <unit> level would still apply unchanged to the sum of segments in the <unit>.