tm-pubsubj message

Subject: Subject identification and ontological commitment : a real-world example
From: "Bernard Vatant" <bernard.vatant@mondeca.com>
To: "tm-pubsubj" <tm-pubsubj@lists.oasis-open.org>
Date: Tue, 28 Oct 2003 16:01:30 +0100

Patrick, and all

> Since we are now procedurally valid, let's turn our attention to
> *substantive questions* and see if we can capitalize on all the good work
> that Mary reminded me has already been done for PubSubj.

I have underlined *substantive questions* in Patrick's post, and I fully
agree with that position. I try below to focus on a question I consider to
be a (the) most substantive one at this point, and which have *always*,
during the past two years of "good work", been either forgotten on the
backburner or swept under the carpet (pick up your metaphor). And the
various debates we had around this non-explicit question were stuck by
everyone having his/her own implicit a priori answer(s) (the way it usually
goes until questions are explicit).

This question was at the core of my former proposal to use OWL for PSIs.
But when I made that proposal, certainly I pushed too quickly the answer
before setting clearly the question - certainly at the time it was not
completely clarified in my mind. Moreover the proposal had too much
political context to be popular. So let's forget about any language,
technical or  process solution for the moment, and focus on the following
questions.

Q1: Is subject identification independent from ontological commitment?

I expand below on the two concepts employed here, and why I consider the
answer to this question to be "no".

Q2: If the answer to Q1 is "no", how can we articulate the two concepts in
our recommendations?

Any further technical recommendation for PSI structure, metadata,
publishing process, use ... should be based on explicit answers to Q1 and
Q2, and a consensus on those is IMO a prerequisite to any further
deliverables.

We have addressed in Del 1 the question of subject identifiers and subject
indicators, but we have not really addressed the question of *subject
identification*.
Subject identification is based on agreement to use some type of subject
identifiers, following the same set of rules, in some type of processing
context. For example, XTM use of subject identifiers is such a processing
context. Subject identification in XTM processing is linked to the use of
identifiers in some specific way, like under <subjectIndicatorRef>. If a
subject identifier (URI) is used under <occurrence>, it does not
necessarily support a process of subject identification (and merging). And
if the same subject identifier is used outside XTM context, what
could/should be the identification process and rules? Do we let every other
user of PSIs set their own rules for subject identification outside TM? Or
do we mention/recommend processing context and rules? (e.g. in XML, RDF,
OWL, UDDI, DC metadata ...)

Let me take a real-world example, where universal identifiers (ISBN
numbers) are efficiently used for subject identification in a distributed
environment.

http://isbn.nu is a very cool site providing syndicated search on books
based on Author, Subject, Title or ISBN.
http://isbn.nu/about.html says much about our issues in a nutshell. "This
site is a proof of concept of several ideas about information management,
organization, and linkage. It's also an attempt to show how smarter systems
combined with cleaner URLs can create shortcuts around roadblocks."

Look at how it works. Type in the ISBN search field either "0-534-94965-7"
or "0534949657". The URL generated from that search is
http://isbn.nu/0534949657 - de facto an efficient subject identifier for
the book of John Sowa "Knowledge Representation" in this context. Note the
clean syntax, no weird query string, as simple as can be. If you search by
author or subject or title, you will retrieve a list of books, each
identified by one of those ISBN-URL-PSIs.

What you get from that URL is a search result syndicated from various
booksellers, including current availability and prices, and links to
partners sites. The process is obviously using the ISBN identifier
throughout to query different data bases in various ways, which figures all
the partners have set an agreement on the way to deal with ISBN, both as
internal subject identifier and for syndication transactions. BTW very
impressive results. One interesting thing is one of the partners is
amazon.com. But if you search directly at amazon.com for "ISBN 0534949657"
or "ISBN 0-534-94965-7" you get total silence for the former and total
noise for the latter. So the same data base, depending on the processing
context, can make sense or not of subject identifiers.

I find this example very interesting food for thought and wonderful
illustration of what subject identifiers can achieve in a distributed
environment. And what we can learn from it is why and how it works so well.
It seems that some reasons can be listed.

1. The class of subjects which are uniquely identified in the process
(Book) is clearly defined and well known from all the actors in the
information system : syndication site, providers, end users.
2. All the actors are aware of ISBN as being an identifying attribute for
this class Book and use it that way.
3. All the actors make the same sense of attributes attached to (and is
some sense defining) instances of that class, either generic and permanent
ones (title, author, publisher, publication date) or local context-defined
ones (sales price, availability, time to ship).
4. The issue of identifiers being URIs or not, is tackled here in the
simplest way possible. Internally, the system certainly uses only the ISBN
itself, but the Web human interface uses a URI-PSI-fied form of ISBN, with
an obvious one-to-one correspondence.

It is clear that all the above 1, 2 and 3 boil down to have all actors in
the process commit implicitly to the same ontology for the class Book - an
ontology which could easily be explicited and formally expressed using the
above-quoted attributes. This kind of explicitation whatever its formal
expression, is certainly set under the hood to allow syndication of content
between http://isbn.nu and providers.

So, coming back to Q1, we see by this example that efficient subject
identification needs some ontological commitment of all the users of the
identifier in the same context, and coming back to Q2, some hints are given
of what this ontological commitment consists of, and how it could be
explicited by formal reference to a common ontology.

Hope that helps to understand what I am about now.

Bernard
Follow-Ups:
- Re: [tm-pubsubj] Subject identification and ontological commitment : a real-world example
  - From: Lars Marius Garshol <larsga@garshol.priv.no>
References:
- Positions, process and PubSubj
  - From: Patrick Durusau <Patrick.Durusau@sbl-site.org>