[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Sub-flows example, etc.
Hello everyone, I'm CCing the TC list: because the examples are of the whole documents it may interest them. But this is mostly for the in-line markup SC. Based on our last SC call, I've tried to work a bit on the sub-flow mechanism. It's now implemented in the latest build of the experimental extractor. (See http://www.opentag.com/okapi/wiki/index.php?title=Rainbow_TKit_-_XLIFF_2.0 for details) I've also attached a set of output for an example file. The example file is an HTML document that: - has a character invalid in XML to try out <cp>. - has a few title and alt attributes to try out sub-flows. - has been segmented and has some formatting codes overlapping segments to try <sc>/<ec> vs <pc>. - it has also been pre-translated with Google Translate so we have some <matches> entries. I've generated the output using the three inline code storage forms: - example.html_1.xlf = no original data - example.html_2.xlf = original data are within the inline codes - example.html_3.xlf = original data are outside the content Note that the generated documents don't validate against the latest schema since they have a few elements that the TC has not discussed yet (like <originalData>) that are related to the inline markup, but reside outside. All this is obviously just experimental. A few notes: --- We may have sub-flows information that is out of context. For example we have several <match> elements that have sub-flows, in the example they are coming from an MT translation of the source so the sub-flows are in-context; but in other use cases they may come from anything (TMs, alignment, etc.) and therefore the sub-flows information may be irrelevant (i.e. point to non-existing units). Should we keep it or not? --- We start to talk about having a standard way of dealing with the skeleton of the codes that have sub-flows. But coding this implementation I realize this may not be a great idea. Different tools may have very different ways to do this, and forcing a specific way could affect significantly their overall extraction mechanism. Okapi's filters are a good example of that: You'll se that the <b> code in unit "tu2" has some placeholder that refers to the unit where the title text is ("tu3"). But in the unit "tu5" there is no direct reference to the two sub-flows in the code's skeleton. That's because we group things together and have intermediary levels in some cases. It allows more flexibility on how to access the extracted data in our model. I'm sure other tools would have similar issues to work with one mandatory way of coding referencing. That's all for now. Have a good first week-end of Fall (or Spring for a few) -yves