© 2000 Peter P. Jones, Wrox Press Ltd. All rights reserved. 2000/11/06.
The following outlines a proposal for storing metadata about a Topic Map within the topic map itself. Although the mechanism can be extended to support whatever metadata is seen fit, this proposal is specifically concerned with two sub-proposals:
1) A way to represent the information of an XML Namespace for a topic map document;
2) A way to represent the changes made to the content of an XML Topic Map for the purposes of avoiding ID value clashes in packaging topic maps within an xtmdoc topic map interchange document.
The second of these in my proposal is presented as an alternative to the mechanism proposed by Kal Ahmed of Ontopia in the XTMDoc Processing section of this document. Ahmed's proposal has a number of key merits (not an exhaustive list, necessarily):
a) Gathering the required information for packaging is fast
b) The mechanism for altering the document to avoid ID clashes is extremely simple as is the manner in which that change is recorded
c) Removing the changes is in principle very fast too.
d) None of the above stages requires any processing by a Topic Map processor.
Potential detractors from Ahmed's proposal are:
a) It prefixes the ID values in a Topic Map interchange syntax document in blanket fashion, and there is no way to determine successful re-mapping other than by hoping. For example, imagine that my topic map contains sets of ID values like these:
— ...id="myid-NNNNNNN" where N is a numeral.
— ...id="pre-myid-NNNNNNN".
Let us also suppose that if Ahmed's method does not check all ID values first, then it could also have chosen the prefix "pre-" arbitrarily. Owing to a system hiccough, the process of prefixing the values fails before it touches the second set with two results. The prefixing of the first set might now have resulted in an ID clash with the second set, so the problem of clashes is still there. Secondly, I have no way to tell whether the packaging process failed or at which point it failed (unless I have a log file that I have to go and check manually, assuming the system hiccough didn't ruin the log file writing process too — effectively the log file contains crucial metadata as well).
If Ahmed's method checks all ID values first in order to avoid the above then why not just re-map the specific values that clash.
b) Packaging using Ahmed's method requires more syntax to be added to the XTM interchange format. Let's suppose that I now want to convert this XTM document to an ISO 13250 one without putting it through a full Topic Map processor, say, using Architectural Forms processing. Where does the information in Ahmed's new syntax go in an ISO 13250 document?
Additionally, the results of producing or processing this syntax are not exposed to the user at either end of the Topic Map interchange process for assessment, re-use, faultfinding and so forth. Re-use of this information might, as I see it, turn out to be critical. Imagine a situation where I want to replicate the contents of several topic maps from separate servers into one big topic map on a single mirror site. Each of the separate topic maps is at a given location (a URL, for the sake of argument) and each has its own set of IDs. When I move these maps to the mirror server I want to preserve the mechanism whereby I can map the latter part of a URL, pointing at an ID, locating a place in a topic map on to the former part of the URL of the new mirror server using a straightforward redirecting script that simply rewrites the front part of the URL to point at the new mirror server. Ahmed's proposal seems to suggest that after interchange of the xtmdoc topic map package, the topic map processor would simply give all the topics and associations in the merged map new IDs, discarding the re-mapping information as being needed only for the interchange package. There is also an issue here in that information about the original namespace of information in the topic map documents is not preserved for roll back of transactions and so forth (see (1) above as the other part of my proposal that deals with this).
I propose an alternative to Ahmed's approach to (2). It has merits and it has drawbacks too. I will endeavour to outline what these are without bias so that people can make their own choices.
In order to deal with the issues of namespaces and packaging (particularly ID clashes in that process) I propose that a structure akin to the following (this structure is only a prototype, and I leave it to the AG to suggest significant optimisations where necessary) should be agreed on as the basis for storing certain metadata about the processes involved in the interchange of topic maps.
As you can see in the diagram the 'Namespaces' Public Topic plays a role in an association that has (in this diagram, at least) two topics attached, one for information about the XML Namespace of a document, and another to deal with ID clash re-mapping information. I will deal with the ID clash re-mapping mechanism first.
As I have outlined in mails to the XTM-AG mailing list the process for generating this information works in a manner like the following (again, I leave it to the committee/implementers to suggest optimisations). Let's imagine that there are three topic maps to be packaged into one xtmdoc interchange document.
Each of the documents is scanned by the packaging processor and the originating URL and IDs within the document are noted. When all the documents have been scanned ID clashes are determined and for any ID clashes one of the two IDs that clash is re-mapped to a completely new value. (Note that this new value must be ugly enough to make it clear that a value has been re-mapped at that point in the document.) The metadata about the ID clash is then recorded in a topic map structure as shown in the diagram, with Public Topic Identifiers for the relevant topics. The original value of the ID is stored under the "ID Clash Re-mappings Old Values" topic, and the new value is stored under the "ID Clash Re-mappings New Values" topic. A topic is also created that stores the location within the topic map at which the change was made as its identity (see below as to why). This extra topic map information would be written into each relevant topic map document's syntax within the interchange xtmdoc package.
Ahmed raised the issue of how these values in separate topics were to be synchronised. I suggested a mechanism that used scopes but at that time I didn't have a clear solution. I propose now that the location at which the change was made be reified as a topic and used to scope the two values (see T-Loc in the diagram — this reified topic can then also be referred to as being in the XML Namespace for that document — more on this later). This does create a certain amount of semantic overhead — as Ahmed termed it — but now this is not redundant overhead and indeed preserves that extra piece of information for re-use in a look-up table for redirecting specific requests directed at the original IDs (or some such means of implementation) or for assessing atomicity of re-mapping transactions (or whatever you think you want to do with it). T-Loc can also be typed as being of type 'ID Clash Re-mappings' to cement its relationship to the other constructs here, if needed.
The merits of this proposal as I see are as follows:
a) It uses no new syntax and is completely in conformance with ISO 13250 so there is no architectural form processing overhead for conversion
b) All the packaging transactions are recorded clearly alongside the original data and are exposed for future use in whatever fashion user or implementer desires.
c) The re-mapping information facilitates the rapid interchange of topic map data between topic map servers on the Web by preserving the integrity of most of the ID data without change. Coupled with the information about its original namespace (XML Namespace, perhaps) I see this as easing transactions between servers for topic map data. It preserves more useful information than Ahmed's method by changing less data.
d) The mechanism is extensible in ways supported by existing topic map interchange syntaxes.
e) It is, as I see it, more robust than Ahmed's approach. Failures of packaging are readily detected by simple XML/SGML validation, and the point at which failure occurred could be readily ascertained.
f) How this information is used once it is within the topic map processor is unspecified — but that doesn't matter as long as it conforms to interchange syntax; some processors might make a useful feature out of doing things with it, some might just ignore it.
g) Like Ahmed's approach the packaging and unpackaging process does not require the use of a topic map processor, as long as the topic map structures used for the ID clash re-mapping information do not deviate from an agreed form — but requirement of agreement on form of structure is no different from agreeing to use a new fixed syntax in the interchange document, as in Ahmed's proposal.
The drawbacks are:
a) If the information remains in the topic map then there is some extra information to be processed by the topic map processor. As I see it this extra information will not accumulate though, simply alter in relevant ways across successive xtmdoc interchange packagings.
b) The packaging processor has to be a bit cleverer than the one for Ahmed's method.
c) How this information is used once it is within the topic map processor is unspecified — but that doesn't matter as long as it conforms to interchange syntax; some processors might make a useful feature out of doing things with it, some might just ignore it.
This sub-proposal for (1) is separate from the issues concerned with packaging and ID clashes even though as I hinted above, the proposal for (2) could make use of it. The idea in this sub-proposal is still very much a work in progress and is not necessary for XTM v.1.0 in any case. I mention it here merely to raise awareness.
I propose that a topic link element in a topic map could be used to address all those constructs within a topic map (whether in interchange syntax format, or within the processed topic map data model) to indicate that those constructs were in a particular XML Namespace. The topic would have the Public Identifier of 'XML Namespace' as its identity and it would have the type 'Namespace'. Its basename would be the URI for the namespace. Occurrence elements would then address the various constructs and the occurrence link elements could have types that accorded with those of the constructs being addressed: 'element' for an element type, 'attribute' for an attribute type, and so forth.
Using this approach there is nothing to prevent a particular construct in the topic map from being addressed by more than one XML Namespace topic. If there is overlap, at the time of serialization the user must choose which constructs he wishes to place in what namespace for a given serialization. This might be of use, for example, in those cases where the serialization is to be extracted from or embedded within a non-topic map document such as DocBook.
I have also mentioned the possibility of using this approach to supply metadata over ordinary XML documents concerning their XML Namespaces. I will not address how this would be implemented here, but note that there is nothing to prevent such metadata being attached to an XML document by means of a processing instruction to that effect (in a manner similar to that of XSL style-sheets). Name clashes for element types and attribute types could be resolved with an approach similar to that of ID Clash Re-mapping above, but using a different structure to support the data
Merits of this approach:
a) There are no nasty prefixes on tag names that prevent straightforward DTD validation with the basic interchange DTD.
b) It allows us to provide XML Namespace information over topic maps in a suitable manner
I haven't managed to think of any drawbacks just yet.