dita message

Subject: Re: [dita] Linking Related Issues To Considered in 1.1
From: Eliot Kimber <ekimber@innodata-isogen.com>
To: DITA TC list <dita@lists.oasis-open.org>
Date: Tue, 01 Jun 2004 15:49:33 -0500
JoAnn Hackos wrote:

> Eliot, I'm interesting in your 4th item but lack the technical 
> context to understand it. Could you explain what you're getting out 
> in non-technical terms? What kind of reference is needed? If the 
> other 3 thoughts have translation, that would also help for those of
>  us who understand the business requirements but can't necessarily 
> translate them into technical requirements.

I'll take my best shot. I can't guarantee I've made it any clearer.

I'll try to explain the issue in 4 and then try to translate the other
issues below.

We might call issue 4 the "link target rendition context ambiguity"
issue. That is, in some circumstances it can be ambiguous what the
*rendition* context of a particular link target is or should be. This is
entirely a side effect of going from a modular authoring data set (e.g.,
DITA topics related by maps and other links) to a monolithic delivery
form that is not modular, such as creating single PDF "books" from a set
of topics.

This is an issue that I'm starting to run into in project work I'm doing
that uses XInclude for doing use-by-reference. The problem has always
been inherent in any scenario where you have use by reference and 
linking where a given target element might occur in two different use 
contexts. DITA, because it explicitly enables use-by-reference and does 
linking makes it likely that this problem will occur in some situations.

For example, imagine you have a topic that links to a another topic,
call them Topic A and Topic B.

In the authoring repository, the link is authored as a reference to the
target *as authored*, that is, something like:

topic-a.xml:

<topic id="topic-a">
   <title>Topic A</title>
   <body>
     <p>See <xref href="topic-b.xml"/>.
   </body>
</topic>

topic-b.xml:

<topic id="topic-b">
   <title>Topic B</title>
   ...
</topic>

At the authoring level there is no ambiguity--the reference is to a
single target element (the <topic> element within topic-b.xml) and its
location is unambiguous.

Now imagine that we have two "master" topics that in this case serve to
define "units of publication" that we intend to publish as single PDFs,
Master 1 and Master 2. Both Master 1 and Master 2 use topicref links to
include Topic B, e.g.:

master-1.xml:

<topic id="master-1">
   <title>Big Book One</title>
    ...
   <topicref id="b-ref-1" href="topic-b.xml"/>
    ...
</topic>

master-2.xml:

<topic id="master-2">
   <title>Big Book Two</title>
    ...
   <topicref id="b-ref-2" href="topic-b.xml"/>
    ...
</topic>

We also have Master 3 that includes Topic A:

master-3.xml:

<topic id="master-3">
   <title>Big Book Three</title>
    ...
   <topicref href="topic-a.xml"/>
    ...
</topic>

Now we have two rendition contexts for Topic B: Big Book One and Big
Book Two.

When we render master-3 we want the link from Topic A to Topic B to be
reflected in the renditions, but without more information it's ambiguous
whether the link should be to Topic B as included Master 1 or Topic B as
included in Master 2, or both.

Therefore, it's not sufficient to just point to the task in Topic B, we
need to somehow indicate which use context for B we want to link to.

The only general solution to this is to indicate, on the link, which use
context to use. This can be done by including a pointer to the element
or elements that reference Topic B in addition to Topic B itself, e.g.:

<topic id="topic-a">
   <title>Topic A</title>
   <body>
     <p>See
<xref href="topic-b.xml" use-context="master-2.xml#b-ref-2"/>.
   </body>
</topic>

Now the xref element has enough information to indicate which of the
many uses of B we want the link to resolve to.

That's the basic idea.

Some practical notes:

1. Having a hard-coded link like in the example is obviously counter to
re-use, so one expect to use maps or some other form of stand-off
annotation for this. For example, I could replace the xref with a
map-based link that was in a map specific to the rendition and therefore
"knows" about master-1, master-2, and master-3. Or I could add a new
element to the map markup that serves just to link to embedded links to
establish their use context. For example, instead of adding the
use-context= attribute to topic-a, I could add this to the map
associated with Master-3:

<map title="Big Book 3">
   <topic-ref href="topic-a.xml">
     <use-context-spec
        href="topic-a.xml#xref-01"
        use-context="master-2.xml#b-ref-2"/>
   </topic-ref>
   ...
</map>

2. The use context pointer may need to be a list of use contexts, since
one use-by-reference may itself be included in several contexts it's not
necessarily sufficient to point to just the immediate pointer to a link
target. To continue the use-context-spec markup, you could do this with
nested elements, where nested elements establish the context for their
parent elements, e.g.:

<map title="Big Book 3">
   <topic-ref href="topic-a.xml">
     <use-context-spec
        href="topic-a.xml#xref-01"
        use-context="master-2.xml#b-ref-2">
       <use-context-spec
          href="map-2.xml#topic-ref-04"/>
      </use-context-spec>
   </topic-ref>
   ...
</map>

3. Note that this problem can occur even when there is only a single
rendition output because the same target might still be included
multiple times within a single set of topics. For example, there might
be a common subtask that is used by many higher-level tasks (in aircraft
maintenance docs at Boeing we had the "put away your tools and clean up
your work area" subtask that was included by almost every (if not every)
main task). Therefore, it is not sufficient, in the general, to simply
require that all links be within the same "unit of publication".

4. Note that when the storage structure of the rendition matches the
topic storage structure such that every target exists exactly once in
the renderered output then the problem does not occur. This would be the
case for example when generating HTML pages from topics or some such.

5. For any given document set it may be possible to avoid this problem
either by limiting the scope of links or by imposing editorial
constraints or practices that tend to avoid the issue. However, I don't
see how the problem can be avoided in the general case, especially as
you move toward an environment where the scope of re-use is wide (with
respect to the amount of communication between the humans involved) and
where units of publication are generated as dynamically as possible, for
example by queries against topics based on descriptive metadata. In
these situations it will be difficult or impossible to prevent target
ambiguity editorially.

However, it will always be possible to *detect it* for a given set of
documents to be rendered as a unit, so it will also be possible to only
worry about solving the problem when it happens--it shouldn't, for
example, be necessary to specify a use context for every link in a map
or topics, but only for those where there is in fact an ambiguity. So
even though a markup system like DITA must provide a way to resolve link
target ambiguity, I do not think that authors will normally need to
worry about it as a matter of daily practice.

> 1. General practice for addressing.  The current DITA spec and/or 
> implementations appear to be using URLs with fragement identifiers 
> somewhat informally (i.e., not conforming, for example, to the 
> XPointer spec.  I think that DITA should limit itself to XPointer for
>  addressing.  Second, I'm now of the opinion that the XInclude style
>  of address that uses two attributes, one for the URL with no
> fragment ID and one for the XPointer, is the best design and should
> be emulated.

The issue here is that in an XML processing context there is only one
*standard* for representing addresses of components within documents,
namely XPointer. While it's easy to implement a less strict syntax for
doing URL-based addressing in a specific implementation context, such as
DITA *as used within IBM* I don't think it's appropriate in a more
general standard. For example, the URL "foo.xml#/bar/baz" does not
correspond to any current OASIS, W3C, or ISO standard for addressing. As
humans we can see "#/bar/baz" and guess with some confidence that this
is an XPath but we have no *formal* basis on which to make that
assumption. If we were to hand that URL to a generalized Web client and
server pair there is no guarantee that we would get the result we want.

The XPointer spec provides a formal syntax for declaring the addressing
scheme you're using for fragment identifiers, e.g.:

foo.xml#xpointer(/bar/baz)

In this case we know without question that "/bar/baz" is in fact an
XPath addressing the "baz" children of the document root, if it is named
"bar".

The alternative would be for DITA to define it's own syntax and
conventions for doing addressing, but that would essentially mean that
DITA documents could only be processed by DITA-aware Web clients.

This also implies another basic principle for a standard like DITA: it
can only define things in terms of the markup scheme it defines and not
in the context of any particular processing implementation or model.
That is, DITA can only talk about what a link means in terms of its
relationship to other DITA-defined constructs, not in terms of what
things might be generated from those DITA-defined constructs.

One potential issue with the IBM submission as submitted is that it
reflects a pragmatic system that reflects a particular implementation
that didn't require the same degree of abstraction or restraint that a
more general standard requires. This means that there will likely be
several places where markup that worked fine in the IBM implemenation
(because it was implemented) is not sufficiently formal for a
specification that is defined without direct reference to any particular
implementation.

I'm also making a presumption that we are obligated to prefer W3C
standards where they are applicable. Thus my focus on XPointer and the
basic rules for URI resolution.

The deal with XInclude's two-attribute approach is that fragment
identifiers are inherently not interoperable in that it is always up to
the user agent to interpret fragment identifiers. By not using fragment
identifiers in the base URL you both emphasize that the URL is
addressing the containing target *document* and stress that addressing
the specific element is up to the user agent (because it's in a separate
attribute). By naming the fragment identifier attribute "xpointer" you
emphasize that the addressing syntax is XPointer and nothing else.

For practical processing, it makes it *much* easier to process in tools
like XSLT where otherwise you have to do non-trivial string processing
to pull apart the URL part and the fragment identifier part. I've
implemented both forms of address processing and having a separate
xpointer= attribute makes things so much easier. It's still a small
thing relative to the total size of any given system, but it helps.

> 2. Role names should be required to be namespaced so that 
> applications can formally define role names and clearly relate them 
> to the semantics of the role name. For simplicity and backward 
> compatibility we can stipulate that the existing DITA-defined role 
> names are in the DITA namespace.  The implication here is that DITA 
> processors will dereference namespace prefixes on role names in order
>  to construct fully-qualified role names before associating roles to
>  semantics.

For links the role= attribute plays essentially the same role that
element type names play in typical XML processing: it allows a mapping
from the name to some semantic or behavior so that authors can predict
with some accuracy what the processing result will be. One proof of this
is that in DITA, which allows specialization, any link with a role=
attribute could be rewritten as a element type where the role= value is
the element type, e.g.:

<navref href="foo.xml" role="parent"/>

becomes:

<parent href="foo.xml" type="navref/navref"/> <!-- more or less -->

Therefore there's no real distinction between an element type name and a
role name: they are both managed semantic labels that exist in some
controlled set of names in order to associate semantics with data.

As semantic labels, role= values need to be defined and managed just as
element type names do. Therefore it follows that role names should be
namespacable since that is our primary mechanism for associating names
within defined sets of names, such as schemas. One could even imagine an
extension to the XSD Schema markup that defined role types and constraints.

> 3. Make sure we've clearly defined the relationship between format=
> and dynamic determination of MIME types for target resources. Format=
> should be a hint that augments whatever might be learned from the
> MIME type of the target.

A basic principle of the Web is that resources tell the user agent what
their data type is and then the user agent does whatever it can to
accomodate that. This is why XML Notations are essentially pointless:
not only are external data entities fairly pointless, but predeclaring
the data type of some reference is counter to normal Web practice.

Therefore, any time you are using a URI to address resources you are
implicitly in an environment where you should be able to learn the MIME
type of the resource addressed.

That is, a user agent, as defined by the relevant IETF and W3C
specifications, is *any software* that resolves URIs by communicating
with a Web server. As part of that communication, the MIME type of the
resource is provided.

Therefore it is not an unreasonable expectation that DITA processors,
being Web user agents (because they must be able to resolve URIs), will
be able to get the MIME types of resources on demand.

However, as Michael pointed out, not all processing contexts are
necessarily literally Web user agents or may not make MIME type
information easily available. Therefore, for practical reasons, it may
be useful to provide hints about the data type of a given target.

In addition, a MIME type may not be precise or accurate enough to enable
the desired processing in some circumstances.

Therefore, it's probably the case that we will continue to need an
attribute like "format=" but we simply need to be clear in the spec how
it relates to standard MIME-based processing, in part emphasizing that
DITA processors are, by definition, Web user agents and should reflect
that in their implementations.

To that end I could probably contribute a simple Saxon extension that
provides XSLT-level access to the MIME type of any resource that can be
resolved at processing time (Java provides basic infrastructure for
getting or guessing MIME types of resources).

Finally, I'd like say that any production system that expects to do
sophisticated link-based rendition of non-trivial data sets has to be
pretty sophisticated and must provide a good deal of infrastructure to
enable this. Therefore, while you can get a lot of mileage out of just
files on the file system and XSLT transforms, for industrial-scale use
cases you need more, so we should, I think, expect that full-scale DITA
support systems will provide mechanisms to do things like give you the
MIME type of any target you might link to, simply because it's a
requirement of a complete processing support system to do so.

As I mentioned on the phone today, I'm working on an open-source,
*demonstration* content management system that will be capable of
providing these sorts of link support features at relatively low
implementation cost. This should allow us to provide a reference DITA
support system that shows how such an all-encompassing system can work.

Note that this system I'm building is purely for demonstration purposes
and is expressly not suitable for production use in that it will neither
perform nor scale as implemented (but could be re-implemented to both
scale and perform). I'm doing this to ensure that, as a systems
integrator, nobody gets the idea that I'm trying to build a product or
compete with any current or potential content management partner. Far
from it. This project is simply a way to demonstrate various fundamental
problems in compound document and hyperdocument management and how those
problems might be solved in practice, partly in the hope that content
management vendors might incorporate the useful ideas into their
products, and partly so that my current and potential customers can
fully understand the implications of what they're asking for when they
say things like "we want to do information re-use". It will have the
most restrictive GPL license.  I've submitted a SourceForge project
request for this project but haven't gotten a response yet. Once I do I
will announce it here.

One key feature of my system is a general and extensible import
framework. With this framework I will be able to quickly implement a
DITA-aware importer that will be able to take any conforming DITA-based
document and import all of the components into the repository and then
provide generic, HTTP-based access to those components, including from
standard editors and XSLT engines. It will also provide basic
"where-used" information as a basic repository service. It should also
show how a more specialized DITA-specific support system could be
implemented as another, relatively small, layer on top of the core system.

Cheers,

E.
-- 
W. Eliot Kimber
Professional Services
Innodata Isogen
9030 Research Blvd, #410
Austin, TX 78758
(512) 372-8122

eliot@innodata-isogen.com
www.innodata-isogen.com