topicmaps-comment message

Subject: Re: [xtm-wg] An XTM test suite
From: Lars Marius Garshol <larsga@garshol.priv.no>
To: xtm-wg@yahoogroups.com
Date: 18 Feb 2001 11:36:32 +0100

* Steven R. Newcomb
| 
| I don't understand how it can be simpler than the existing XTM
| syntax.

It has to be since:

 - there will be no mergeMap element (they have all been merged in)

 - subjectIdentity will have no topicRef child (would have caused a
   merge)

 - instanceOf, scope, parameters and roleSpec will not have
   resourceRef or subjectIndicatorRef children (will be replaced by
   topicRef elements)

There are probably more simplifications that I have forgotten right
now.

| It looks to me as though it must have more element types (such as
| ones that make topic namespaces redundantly explicit), 

What are topic namespaces?

| and that the element types that do correspond (in some sense) to XTM
| element types will necessarily have different semantics, as well.
|
| For example, the Conceptual Model clearly establishes that, under
| the covers, an occurrence is really a topic-occurrence association.
| What does this mean for the "canonical output" form? 

I believe that it should have no consequences. Representing the
relationships between topics and their occurrences as associations is
not required by the specification, and so there is no need to test if
processors actually do this.

| I believe that we must output a topic-occurrence association (note
| that I did *not* say <association>, I said "association").

Why?
 
* Lars Marius Garshol
|
| This is close to it, yes.  The idea is that in this syntax, any two
| topic maps that are logically equivalent will have the exact same
| serialized representation.
 
* Steven R. Newcomb
|
| It's a good idea, if we can make it work.

I'm glad to hear you think so. :-)
 
* Lars Marius Garshol
|
| A canonical XTM document must
| 
|  - be UTF-8-encoded
 
* Steven R. Newcomb
|
| Why this particular encoding?  What does character encoding have to
| do with it, as long as the mappings between character encodings are
| unambiguous and explicit?

Because the canonical format is easier to use if the output is
guaranteed to be byte-by-byte identical.

UTF-8 is the perfect choice for this, since it can represent all
Unicode characters directly and since it is readable even with tools
that are not Unicode-aware.
 
* Lars Marius Garshol
|
| - have all elements (topic, association, baseName, topicRef etc) in
| a specific order, probably based on the lexical order of IDs and
| names
 
* Steven R. Newcomb
|
| I don't see how this can work, unless we want to straitjacket the
| order in which <topicMap> elements and their contents are scanned
| and processed, and force all applications to keep a record of that
| order, even though that order has no significance.

The idea is not to reproduce the original input order, but to impose
_a_ specific order. If there is no specified order there is no hope
that the output from different processors will be identical, either.

| This is a very unappealing prospect: to require applications to keep
| track of nonsignificant information, incurring significant overhead
| just so their conformance to the Spec can be verified.

I agree. It has to be a goal for the canonical format to avoid this.
 
| The unique identifiers (IDs) of elements found in the
| content of <topicMap> elements cannot serve as the
| basis for imposing a canonical order, either.  
| 
| * First of all, many (perhaps most?) of the elements
|   that demand the existence of topics in the
|   application-internal representation are #IMPLIED, so
|   we won't have IDs for all of them.  What do we do
|   with the ones that don't have IDs?
|
| * Secondly, when we're merging multiple XTM documents,
|   the IDs of the elements aren't necessarily unique.
|   What do we do when two topics have the same ID?
 
Good points. That means two things: that we can't use IDs, and that the
canonical spec must specify how to assign IDs to all topics.
 
* Lars Marius Garshol
|
| - have all attributes in a specific order (and
|   possibly conform to the canonical XML specification)
 
* Steven R. Newcomb
|
| OK.  (Why only "possibly"?  Making everything totally
| deterministic is the whole point of this exercise.)

I say possibly because I haven't really thought it through. The spec
would either have to say that all canonical XTMs must conform to the
canonical XML spec, or leave it out entirely.
 
* Lars Marius Garshol
|
| - have only normalized URIs
 
* Steven R. Newcomb
|
| What constitutes "normalization" of URIs? 

I think we will have to specify it, but at least:

 - case normalization of scheme and host names
 - removal of default port numbers
 - deterministic %-escaping and absolutization

| We must not create a conformance requirement that prevents
| application builders from competing on the basis of the amount of
| intelligence that is brought to bear on the question of whether two
| URIs actually refer, ultimately, to one and the same resource.

I agree that this should be a goal, and I think it is achievable. It
may mean, however, that test cases will have to be constructed in such
a way as to not cause such extra intelligence to cause extra merges
that would otherwise not happen.

| One way to handle this is to support a user's ability to "dumb down"
| the URI-comparison processing to some specified level, just for
| purposes of outputting a canonical form simply for establishing
| conformance to the Spec in all other Spec-required respects.

I thought about that, too, but producing test cases that do not make
this an issue may be easier to achieve. In this case the input is
to some extent controlled, which makes things somewhat easier.
 
| This remark leads me to believe that you are thinking in terms of
| using some version of the XTM syntax as the canonical output syntax,
| as if XTM syntax were somehow the same thing as this canonical
| output idea.

Well, yes, that was the idea. It seemed natural since we already have
a serialization syntax for these constructs to build on it and modify
no more than necessary.

| * It would be very bad if there were any confusion
|   whatsoever about whether a particular XML element or
|   document is expressed in XTM syntax or in our
|   canonical output syntax.  The best way to avoid such
|   confusion is to avoid having element type names in
|   common between the two syntaxes.

Hmmm. This could be achieved by using different namespaces, I guess.
 
| * Having element type names in common will greatly
|   diminish our (the XTM Authoring Group's) ability to
|   communicate clearly and unambiguously among
|   ourselves.  When we say "<topic>", we really must be
|   disciplined in meaning only what that string
|   (<topic>) means at input time, because the
|   corresponding construct that appears in canonical
|   output is not exactly the same kind of thing (for one
|   example of why this is true, see the discussion of
|   topic-occurrence associations, above). If we don't
|   establish these distinctions in our discussions, we
|   will misunderstand each other, and our productivity
|   as a group will be diminished.

I find it difficult to imagine any possibility for confusion here.
Anyone saying <topic> and meaning the topic element in the canonical
syntax will just have to make that clear from the context, since this
will most likely be a rare occurrence. And even if confusion were to
occur I don't see how it could become very serious.

| * Having element type names in common will muddle our
|   thinking as individuals.  We must not allow ourselves
|   to make unconscious assumptions about the nature of
|   processed topic map information.  The structure of
|   the canonical output must reflect precisely the
|   abstract structure of the application-internal form
|   of topic map information, as it will be defined by
|   the Authoring Group.  The syntactic structure of the
|   input documents is irrelevant, and pretending that it
|   is somehow relevant will only blind and confuse us.
 
I strongly agree with everything you say in this point, except for the
first sentence, and I don't really see how it is connected to the rest
of your paragraph. How will using nearly the same syntax for
serialization and canonicalization muddle our thinking?

* Lars Marius Garshol
|
| This I don't follow. You seem to imply here that something more than
| what I propose above is needed. My problem is that I have a release
| schedule to meet and must act very quickly indeed. So if something
| radically more complex is needed I would prefer to do this first,
| and then that as a second stage.

* Steven R. Newcomb
|
| OK.  In order to walk in a particular direction, we must move by
| steps.  I would only ask that each of us tries to be objective about
| technical decisions.  That means trying not to make technical
| decisions on the basis of our own individual business objectives,
| but rather on the basis of how best to develop the industry as a
| whole.  The only thing that competitors can be expected to agree
| about is how to make the industry grow (and even that much is a
| minor miracle).  I hope there won't be too many conflicts among us,
| and that the resolution of the conflicts can be navigated in a way
| that doesn't bruise anyone economically.  Taking well-considered
| steps *together* is a good way to do that.

I agree with all of this.  It was silly of me to raise the subject at
all.  Please forget it.

| BTW, I'm voting "Yes" on XTM 1.0, although I have grave misgivings
| about Annex F, which I find misleading -- not so much by what it
| says, but by what it doesn't say.

What is it you feel it should say that it does not?  Of course, it
lacks an object model and so necessarily is only a shadow of what it
ought to be, but given that it is acceptable, I think.

--Lars M.


------------------------ Yahoo! Groups Sponsor ---------------------~-~>
eGroups is now Yahoo! Groups
Click here for more details
http://click.egroups.com/1/11231/0/_/337252/_/982492589/
---------------------------------------------------------------------_->

To Post a message, send it to:   xtm-wg@eGroups.com

To Unsubscribe, send a blank message to: xtm-wg-unsubscribe@eGroups.com
References:
- [xtm-wg] An XTM test suite
  - From: Lars Marius Garshol <larsga@garshol.priv.no>
- Re: [xtm-wg] An XTM test suite
  - From: "Steven R. Newcomb" <srn@coolheads.com>
- Re: [xtm-wg] An XTM test suite
  - From: Lars Marius Garshol <larsga@garshol.priv.no>
- Re: [xtm-wg] An XTM test suite
  - From: "Steven R. Newcomb" <srn@coolheads.com>