topicmaps-comment message

Subject: [xtm-wg] lazy processing vs. extended XLink and BOS processing
From: "Steven R. Newcomb" <srn@techno.com>
To: peterj@wrox.com, xtm-wg@egroups.com
Date: Wed, 19 Jul 2000 12:59:46 -0500
Peter --

You seem to be saying that it's OK for incredibly time-consuming and
bandwidth-consuming searches to be performed by web crawlers, but it's
not OK for very similar processing to be performed by topic map
engines as they assemble lookup tables for addressed nodes in the
hyperdocuments that they, together with the resources that contain
their topic occurrences, constitute.  There is a huge functional
difference between (a) the indexes that a traditional web crawler
assembles and (b) the lookup table for addressed nodes in a
hyperdocument, but the expense of making and persisting the latter is
only marginally greater.  If you're willing to incur the cost of web
crawling, why would you object to making really useful hyperdocument
lookup tables while you're at it, so that all the inverse
relationships are also available?  It's those inverse relationships
that make topic maps able to offer something radically better than
what's already available.  Without that radical improvement, I fail to
see the point of our whole XTM Specification effort.

[Steve Newcomb:]
> To make a selection from some set, you must first obtain the set
> from which you want to select.

[Peter Jones:]
> [PPJ] Can't I just know the type/properties of the set and some
> location within it. Less overhead.

It's not really so much less overhead, compared with the cost of web
crawling.  You have to do a web crawl to get the
type/property/location information you're talking about.  The overhead
involved in doing that is enormous.  Why not make it really count?  If
you have to expend all the processing and bandwidth necessary to
obtain, by web crawling, the limited set of information you seem to
want to be satisfied with, it's wasteful to fail to remember also what
is addressed by what, and in what context(s).  This "inverse
relationship" or "where referenced" information is essential to
understanding topic map documents.

The approach you seem to be espousing, if I understand it, is that we
should be able to use existing web crawler technology as finding aids
to find topic links within topic maps that appear on the web.  In your
scenario (if I understand it), web crawlers should discover topic map
documents on the web, and then they should be able to take people
directly to those topic links in those topic map documents, and then
the people taken to those topic links should be able to use the topic
links to get wherever they're going.  Some of the problems with this
approach are:

* Such users won't know about the associations in which such topics
  participate, so the bulk of the usefulness of the topic map will not
  be available to them.

* Topics that don't happen to use your query keyword as a topic name
  won't be found, even if they're exactly the right topic.

* Scopes won't have any utility for information hiding, because the
  application won't know the scopes, much less anything about the
  topics that appear in the scopes.  You'll get many irrelevant hits.
  Infoglut redux.

* Public topics won't be useful hubs from which you can get to topic
  maps that contain topics with the same subject identity.

In short, the approach you propose (if I understand it) would turn
topic maps into things that are not materially different, in their
functionality, from ordinary HTML documents in which every link always
takes you from where it is to somewhere else.  If that's all you want,
I suggest maybe we should create an XML DTD for documents that only do
what you're interested in doing.  But please be informed that I, for
one, will implacably oppose calling such an architecture a "topic
maps" architecture, because it won't support topic maps.  None of the
primary design goals of the topic maps paradigm will be met by it.

> [PPJ] In the situation I am describing (I will have another stab at
> improving the communication of this in a forthcoming mail) I
> envisage the creation of the BOS as something that takes place after
> the retrieval (see also later comments about scope and integrity in
> this mail).  As I don't yet have adequate understanding of HyTime
> yet I can't judge how flexible a BOS is. How open to revision at
> run-time is it?

It's as open to revision as we want it to be.  Indeed, the BOS is
whatever we want it to be; it's declared (or not declared) however we
want to declare (or not declare) it.  The concept of BOS is
inescapable, however.  The BOS is whatever the application in fact
decides that it is; if the application solicits user input on the
question of what to regard as being "in the BOS", then the BOS is
whatever the user decides it is, whenever the user makes that
decision.  The BOS is the de-facto perimeter beyond which the
application does not know what's doing any addressing.  It's axiomatic
that all applications are limited in this way.  The BOS of an HTML
browser, for example, is the currently-displayed HTML document, full
stop.  I say this because HTML browsers do not know what, in the
currently displayed document, is being addressed by links in other
documents.

There is a one-to-one correspondence between BOSs and hyperdocument
lookup tables (in HyTime parlance, such tables are called
"hyperdocument groves").  It is not a problem for a single resource to
participate in any number of hyperdocument lookup tables (i.e., to be
regarded as a member of any number of bounded object sets (BOSs)).
There is no reason (other than tradition) why a web crawler can't
produce lots of hyperdocument lookup tables as a side-effect of its
web crawling activities.

[Steve Newcomb:]
> You also seem to be suggesting that we standardize some algorithm for
> selecting from a topic map only those constructs that are relevant to
> some set of resources.  I question the general utility/advisability of
> this idea.  To make a selection from a topic map may "edit it to
> death" -- e.g. by invalidating the scopes that contain themes that are
> no longer present in the "selected" version.  

[Peter Jones:]
> [PPJ] If the addthms are always required to be at the top of an XTM
> doc a suitable compromise can be reached(?).

Sorry, I don't see the connection between your point and mine.  What
does addthms have to do with it?  Themes can be added (and they *must*
be addable) from within other topic map documents.  If we disallow
this, the merging of read-only topic maps becomes insupportable.
Specifically, it becomes impossible for enterprising persons to
provide topic map products that serve to merge (and thereby add value
to) the topic maps of other enterprising persons.

> Even if the selected portion(s) of the topic map do not require any
> other parts of the topic map in order to have integrity, the topic
> map author's conception of the structure of knowledge will still be
> seriously affected; it's just not the same topic map any more.

> [PPJ] Yes. It isn't. (Echoes of Roland Barthes on the "Death of the
> Author).  But does that completely kill its utility?

I'd say, "Yes, it completely kills the marginal utility of topic maps
over vanilla HTML documents."  How not?  I think you can already do
everything you seem to want to do with plain HTML, or with XML and
"simple" XLink, which is not materially different in its linking
functionality from HTML's <a href="..."> link.

[Steve Newcomb:]
>  I'd be happier to leave this whole question (i.e., the question of
> how topic maps can be made from other topic maps, and of how topic
> maps should be presented in particular contexts) to applications.

[Peter Jones:]
> [PPJ] It might be smart to specify some sort of default association
> that TM processors must implement something to the effect that in
> the absence of any defined associations connecting up topics found
> in the BOS, these will be automatically attached to a
> 'DefaultAssoc_MemberOfThisDocForNow' type assoc.

Leaving aside the question of the purpose or advisability of providing
such a default association, how would a TM processor know whether or
not a topic link was addressed by any association links, without
actually processing all such links?  You seem to be saying that you
would prefer that topic maps provide their own hyperdocument lookup
tables syntactically, internally, and redundantly.  There are serious
problems with that idea, including:

* We have to wait until the XML InfoSet committee completes its work,
  so there is a formal expression of that model (perhaps as an ISO
  property set).  In the absence of this work, there is no Recommended
  way to express addresses, because there's no Recommendation
  regarding exactly what constitutes an addressable node in XML.

* Constant effort will be required to maintain each topic map
  document, in order to keep up with changes in the mapped resources.

* We should discard the existing 13250 syntax; there is no point in
  using it because it's mostly (and it could be made entirely)
  redundant, given the information in our syntactic representation of
  the hyperdocument lookup table.  (BTW, there is already an ISO
  standard DTD for representing such tables, among many others.  It's
  called the "Canonical Grove Representation DTD.")

[Steve Newcomb:]
> Either the anchors are known or they are not known.  That means that
> either you have processed the whole bounded object set, or you have
> not.  You can't make this computation lazily.  If you have not made
> the computation up front, you have no way of knowing, when you're
> looking at something, what may be linked to it.

[Peter Jones:]
> [PPJ] I don't agree that on the WWWeb things like this cannot be
> done lazily. It would seem to me that in the arena of publicly
> available topic maps on the web it is more like a necessity that we
> be able to do this.

[Steve Newcomb:]
> Either you have processed the pre-existing TM, or you have not.  It
> can't be dug out piece by piece, unless the overhead of digging out a
> piece is equal to the overhead of processing the entire TM.

> Within the topic map document itself, we can't know what
> associations a topic participates in without reading and resolving
> *all* of the association links.

> [PPJ] See comments about laziness above. I see no problem with
> iterations with a set crawl depth.

A crawl depth has nothing to do with the question of whether you
must process all the association links in a given topic map document.

"Set crawl depth" is an example of a way of specifying a BOS.  In
fact, that's the simplest way of specifying a BOS using the HyTime
syntax for specifying BOSs.  It's called "boslevel".

[Steve Newcomb:]
> Within the set of resources mapped by a topic map (the bounded
> object set (BOS) that includes those resources as well as the topic
> map document itself), we can't know which parts of which resources
> are regarded as occurrences without reading and resolving *all* of
> the topic links.

[Peter Jones:]
> [PPJ] If we are assuming that the BOS is something that is indicated
> in a root doc, and that the only access to the contributing docs is
> via that root doc, then I think we are making some grossly
> unrealistic assumptions about the way access to publicly accessible
> topic maps docs can be controlled.  

I've heard the "grossly unrealistic" charge many times; I
categorically deny it.  True, it's grossly unrealistic for anyone who
believes that extended XLink is grossly unrealistic.  Many web people
evidently believe that extended XLink is grossly unrealistic.
Existing technology (e.g., X2X, GroveMinder) makes extended XLink
pretty gosh-darn realistic-looking.

> Think about the way exisiting Search engines on the web just index
> the whole shebang and let you dive in at any doc that's indexed.

...and just think about all the irrelevant hits you get with today's
search engines.  Infoglut is one of the most important problems that
the topic maps paradigm was designed to solve.  I believe you're
proposing to unsolve the problem, here, in the name of lazy
processing.  I've always thought that computers were supposed to
improve the productivity of humans, not the reverse.

[Steve Newcomb:]
> I realize that some exceptionally simple topic map applications may
> only need to provide traversal service from the map to the
> occurrences, and not from the occurrences to the map.  This is like
> the WWW model of <a href="..."> links, in which you can go to the
> other anchor, but you can't start from the other anchor.  However,
> this simplifying assumption, if generally applied to topic maps, would
> utterly destroy the significance of the phrase "topic map"; it would
> be a misappropriation of the "map" metaphor.  A topic map based on
> this simplifying assumption would be like road map that wouldn't let
> one find and use any appropriate road near wherever one actually was,
> in order eventually to get to wherever one wanted to go; all roads
> would lead one in the wrong direction -- away from the topic links --
> and they could only be entered at the topic links.  If one must first
> be at a topic link in order to get anywhere else, it becomes literally
> true that "one can't get there from here", no matter where "here" is,
> unless "here" happens to be some topic link within the topic map
> document.  I therefore claim that "lazy" processing of links and
> anchors is incompatible with the whole idea of topic maps.

[Peter Jones:]
> [PPJ] But you could employ something like C++ compilers use when
> they give all those 'unresolved external symbol' errors from the
> lookup table (assuming you've got a resource missing). It primes the
> app to return to the location it got the TM segment from to go back
> an look for more if necessary.

You can't generate a report about missing inverse relationships (the
answers to the question, "Who addresses me?") unless you know about
them, and if you know about them, they're not missing.  I repeat: lazy
processing of links and their anchors is incompatible with the whole
idea of topic maps.  The topic maps paradigm absolutely *requires*
extended XLinks.  Simple XLinks simply won't cut it.

[Steve Newcomb:]
>   Yes,
> pre-processing of bounded object sets (BOSs) is expensive.  Yes, the
> Topic Maps paradigm is not supportable using existing commonplace
> Web-centric applications and processing conventions. 

[Peter Jones:]
> [PPJ] Hmm. I sense sponsorship ebbing away in a matter of femtoseconds.

Hmmm.  I sense huge commercial opportunities for people who can offer
the next generation of extended-XLink-aware web technologies that can
support web crawlers that create and maintain hyperdocument lookup
tables for inverse relationships within arbitrary bounded object sets.

I sense huge commercial opportunities for online information services.

I sense huge commercial opportunities for all kinds of businesses,
small and large.

I sense big changes coming.  As usual, there will be winners and
losers.

Let's remember that sponsorship is not required for the technical work
to continue to completion.  True, Topicmaps.org's marketing budget is
not likely to be funded by people whose goals are limited to selling
current search technologies that cannot support the topic maps
paradigm.  Speaking only for myself, I, for one, am completely
comfortable with that.  If you're not comfortable with that, Peter,
now would be a good time to raise the issue.

-Steve

--
Steven R. Newcomb, President, TechnoTeacher, Inc.
srn@techno.com  http://www.techno.com  ftp.techno.com

voice: +1 972 359 8160
fax    +1 972 359 0270

405 Flagler Court
Allen, Texas 75013-2821 USA

------------------------------------------------------------------------
Replace complicated scripts using 14 new HTML tags that work in
current browsers.
Form the Web today - visit:
http://click.egroups.com/1/5769/4/_/337252/_/964034441/
------------------------------------------------------------------------

To Post a message, send it to:   xtm-wg@eGroups.com

To Unsubscribe, send a blank message to: xtm-wg-unsubscribe@eGroups.com
References:
- RE: [xtm-wg] Dynamic Generation and Serving of Topic Maps
  - From: Peter Jones <peterj@wrox.com>