dita message

Subject: New Proposal: Vocabulary for Capturing Publishing Process Details to Facilitate Cross-Publication XRefs
From: Eliot Kimber <ekimber@rsicms.com>
To: dita <dita@lists.oasis-open.org>
Date: Sat, 20 Oct 2012 09:37:10 -0500
In response to my original proposal 13041, facility for key-based cross
deliverable referencing, Michael asserted that you could manage
cross-deliverable links by using keydefs to define the as-delivered
locations of peer resources and that such an approach would not require any
new architectural changes or any change to existing DITA 1.2 processors.

I have worked through a scenario that uses Michael's approach and am
convinced that he is generally correct. I will post my exercise as a
separate message so as not to overburden this thread.

-----
Aside

The only place where Michael's approach falls down is in knowing, for a
given peer resource reference, what root map the referencing document
intended the peer to be published in terms of. This is because a peer
reference today can only be a direct URI reference to the resource:

<keydef keys="pubB-topic1"
  scope="peer"
  href="../../common/topics/topic-1.dita"
/>

The link establishes the peer relationship to the topic, but it doesn't
establish the publication the referencing map wants that peer to be
published in terms of. It is this aspect of the problem that my proposal
13041 addresses, and I will discuss the issue separately there. But this
issue does not otherwise affect the soundness and utility of Michael's
approach.

End Aside
---------

Given that sets of keydefs can be used to define peer reference as-published
locations, it follows that we should standardize, or at least define clear
conventions for, capturing the information needed make these keydef sets
work for this processing use, just as we have with DITAVAL and SubjectScheme
for filtering.

I don't think it will be that hard to define an appropriate vocabulary and
we can start testing such a vocabulary with the Open Toolkit and, hopefully,
other DITA processors, as soon as we have something drafted.

Thus I would like to propose that we define for DITA 1.3 new vocabulary that
supports the use of keydef sets in processing that results in deliverables
with resolved peer cross references.

The rest of this message outlines the general requirements and a suggested
approach for such a vocabulary.

NOTE: This proposal requires that all peer resources be bound to keys so
that processors can then use key sets to interchange processing details
related to peer resource resolution. This requirement is not inherent in my
original 13041 proposal but I do not object to the requirement.

-----------
Terminology

First, some terminology to help keep the discussion clear (because it can
get a little twisty):

- "Publication" -- the thing to be delivered as represented by a root DITA
map.

- "Deliverable instance" -- The result of processing a Publication to
produce an output reflecting a unique set of input parameters including the
deliverable data type (HTML, PDF, EPUB, etc.), the filtering specs (DITAVAL
files), the delivered location (e.g., URL of where the deliverable will be
published), and any other process-specific parameters what would result in a
different deliverable (in particular, parameters that determine processor
behavior where the DITA spec allows different behaviors, such as filtering
before or after conref resolution).

- "Publication specification" -- The set of parameters used to produce a
deliverable instance.

- "as-referenced keydef set" -- A set of key definitions that reflect the
set of peer resources referenced by a given publication for a given
processing specification. For example, if Pub A references peer topic B1
then the as-referenced keydef set would include the keydef Pub A used to
point to topic B1 as a peer. These keydef sets are used in the processing of
the referenced peer publications so they know which of their resources they
need to generate as-published keydefs for.

- "as-published keydef set" -- A set of key definitions reflecting the key
names as used by a specific publication and the locations of the referenced
resources as published in a specific deliverable instance. These keydef sets
are used in the processing of the referencing publication to produce the
final deliverable with correct peer resource references.

----------------
Processing Model

The general process for producing a deliverable from a given publication
with resolved as-published peer references using keydef sets to
communication among processors is as follows:

1. Process the publication ("Pub A") to produce its as-referenced keydef
sets for each of the peer publications it links to. Each as-referenced
keydef set reflects the publication specification used to produce it. This
is pass 1. [NOTE: DITA 1.2 provides no defined way to specify the root map
(publication) a given peer reference applies to, so without proposal 13041,
there would need to be some processor-specific way to specify for each peer
resource what publication it applies to. At a minimum you would need
metadata on each peer resource's keydef that specifies the publication.]

2. Process each of the referenced peer publications ("Pub B", etc.),
specifying the as-referenced keyset for each publication as a parameter to
the process, to produce the as-delivered key set for each publication. This
process can be repeated for each of the possible deliverables each peer
publication is or may be published to. This results in a set of keydef sets,
one for each publication/publication specification pair.

3. Process the publication from step 1 (Pub A), replacing the original
keydefs for the peer resources with the appropriate keydefs generated in
step 2, to produce the final deliverable for the publication. Note that this
implies the possibility of manual selection of specific keydefs for specific
peer resources, such as choosing the PDF version over the HTML version for a
specific resource. This provides complete control over which delivered
version of a given peer resource a given link resolves to.

This is necessarily a two-pass (or 1 1/2 pass) process because you can't
finish the processing of the publication in Step 1 until you've both
determined the peer resources it points to and the delivered locations of
those resources. [I say "1 1/2" pass because the initial processing in Step
1 need only be that necessary to determine the peer references, it doesn't
have to actually produce any other output.]

--------------------------
Publication Specifications

Given the above definitions, it should be clear that a given deliverable
instance is identified by its publication/publication specification pair,
meaning that, for a given processor, a given publication processed with a
given publication specification will always produce the same deliverable
instance.

It also means that two deliverable instances for a given publication are
distinguished by their publication specifications.

This is important because you need to have a well-defined and reliable way
to communicate *which* deliverable you want when configuring the
as-published result of a given peer cross reference.

By formally defining the notion of "publication specification" it follows
that publication specifications are objects, which means they have identity,
with means they must have identifiers, which means we can use their
identifiers to clearly and concisely talk about them. The only open question
is what form the identifier takes and what space of names it exists in--this
is likely to be processor specific.

In a keydef set that defines the as-published locations of topics from a
given publication, you can specify the publication specification ID to which
those locations apply. This allows observers to clearly distinguish one such
set of keys from other such sets of keys.

For example, say you want to process Pub A, which has references to topics
in peer publication Pub B. As input to the pass-2 processing for Pub A you
can specify the process specification for the Pub B deliverable you want
your peer links to resolve to. In the case where you don't need to
hand-select specific keydefs (e.g., you use the "like links to like"
business rule), then a processor can automatically select the appropriate
as-published keydef set from among those available for a given peer
publication. Or, if you do need to hand-select specific keydefs, you can use
the process specification ID to find the appropriate keydef set.

To support this process with standard markup, we need three things:

1. Markup for process specifications.
2. Markup for as-referenced keydef sets
3. Markup for as-delivered keydef sets

----------------------------
Process Specification Markup

Assuming we want to use DITA-based XML for defining process specifications,
there are the following possibilities:

1. Define a new topic type that captures the details. The topic itself
provides identity and its title can serve as a display label for the
specification, e.g. "PDF for Expert Users on OSX". The topic could be
specified as a parameter to processors or referenced as a resource-only
resource from maps (in the case where you have a map that is intended for
producing exactly one deliverable).

2. Define a new map type that captures the details as metadata within
<topicmeta>. As for the topic approach, the map title can provide a display
label for the specification. Also as for topics, the map could be referenced
as a resource-only resource from maps that are used for exactly one
deliverable.

3. Define a new topicref type that captures the details as metadata within
<topicmeta>. The navigation title for the topicref can provide a display
label for the specification. The topicref could optionally point to a topic
that serves as additional documentation for the process specification.

In thinking about it now, I think I like the topicref approach best, because
it provides a natural way to hold multiple process specifications in a
single XML document, through keys, provides a way to specify a unique name
for each specification separate from the storage location of the
specification, and allows for linking to additional documentation when
necessary. In the case where you want each process specification to be a
separate XML file, you just have a map with one topicref in it, which is
minimal extra overhead.

I think the map option (option (2) above is a non-starter because there is
no way to hold multiple maps in a single DITA-conforming XML documents.

I think the topic option (option (1) above is less compelling because it
would completely separate the process specification from maps but all of
this markup processing is otherwise entirely in the map domain.

So I think using topicrefs makes the most sense. With topicrefs you can
easily have a single map document that collects multiple process
specifications together. By requiring keynames on the process specification
topicrefs you provide a natural DITA-defined identifier. In the case where
you want to manage individual process specifications as standalone
documents, the overhead is imply the <map> wrapper element, e.g.:

<map>
  <process-specification>
     ...
  </process-specification>
</map>

The map can provide a title to give a display label to the process
specification set, which is handy.

I'm not going to try to define the details of the markup for process
specifications here--that would be an exercise for the stage 2 proposal. I
think the general requirements are clear, as outlined above.

-------------------------
As-Referenced Keydef Sets

An as-referenced keydef set would be a map that contains key definitions for
each peer resource referenced by a given publication in the context of a
given processing specification.

Thus, in addition to simply holding the keydefs, it must capture the
following information:

- The root map that the keydefs came from
- The processing specification used to produce the keydefs

My initial proposal would be to define a new map type,
"as-referenced-keydef-set", with one new topicref type, <publication-map>,
and one new <data> type, <processing-specification-id>:

<as-referenced-keydef-set>
  <title>As-Referenced Keydefs for Publication PubA.ditamap</title>
  <as-referenced-keydef-set-metadata>
    <processing-specification-id>procspec-one</processing-specification>
  </as-referenced-keydef-set-metadata>
  <publication-map href="../../pubA.ditamap" format="ditamap"/>
  <keydefs>
   <keydef keys="pubB-topic1"
      href="../../pubB/topics/topic1.dita"
      scope="peer"
   />
  <keydefs>
</as-referenced-keydef-set>

Where the value of the <processing-specification-id> element is whatever we
decide process specification IDs are (which may be processor specific).
Alternatively, it could be a direct reference to the process specification
using normal DITA addressing (e.g., a pointer to a topicref within a map
document).

Note that this keydef set is not intended to be included in any root map--it
is a standalone data set used as input to the processing of the referenced
resource in the context of its publication root map (remembering that we
currently have no defined way to know, for a given peer resource, what root
publication map it is used in the context of). The use of map markup here is
fundamentally just a convenience, but as the ultimate result of all this
processing will be a new set of keydefs, it makes sense to use keydef markup
for this intermediate data set as well--it keeps things clear to authors and
enables use of existing map and key processing infrastructure.

------------------------
As-Delivered Keydef Sets

The as-delivered keydef set is an otherwise normal map containing keydefs
intended to be included in the map for a given publication. At the map
level, the only additional details it needs to include are the peer
publication map and processing specification (that is, deliverable instance)
it reflects.

For each topicref, it needs to include the navigation title for the target
resource and the title of the publication. This information then enables
generation of cross-publication xrefs in the output without additional
processing, e.g., "See Topic 1 in Publication B".

As for as-referenced keydef sets, my initial proposal is a new map type,
"as-delivered-keydef-set", with the same <publication-map> and
<processing-specification-id> elements. Its content would be normal keydefs
with the addition of a new <data> specialization, <pubtitle>, that captures
the title of the publication the peer resource is in, e.g.:

<as-delivered-keydef-set>
  <title>As-Delivered Keydefs for Publication PubB.ditamap, HTML for
OSX</title>
  <as-referenced-keydef-set-metadata>
    <processing-specification-id>
     procspec-html-osx
    </processing-specification>
  </as-referenced-keydef-set-metadata>
  <publication-map href="../../pubB.ditamap" format="ditamap"/>
  <keydefs>
   <keydef keys="pubB-topic1"
      href="../../pubB/topics/topic1.html"
      scope="peer"
   >
    <topicmeta>
      <navtitle>Topic 1</navtitle>
      <metadata>
        <pubtitle>Publication B</pubtitle>
      </metadata>
    </topicmeta>
  <keydefs>
</as-delivered-keydef-set>

Note that this document is very similar to the as-referenced keydef set, but
reflects the peer publication, Pub B, not the referencing publication.

When included in Pub A's root map before any other keydefs, the keydefs in
this map will take precedence and will therefore determine the address to
use in the published deliverable for Pub A.

Note that the value of the @href attribute points to the resource *as
delivered*, meaning that the only change to it made by the delivery
processor might be to adjust the relative pathing, but to otherwise leave it
alone (and if it is an absolute URI, always leave it alone).

-------
Summary

With some relatively simple conventions for generating and manipulating
keydefs and capturing definitions of processing specifications, we can
enable reliable and practical generation of peer-to-peer references in
publications as delivered in a way that does not require any magic or
processor-specific stuff or changes to current key-based processing (meaning
the mechanism can work with existing DITA 1.2 processors). The approach
supports both completely manual manipulation of the keydef sets as well as
enabling automatic manipulation of them. It does not require any
architectural change, only the addition of new vocabulary based on existing
types.



--
Eliot Kimber
Senior Solutions Architect, RSI Content Solutions
"Bringing Strategy, Content, and Technology Together"
Main: 512.554.9368
www.rsicms.com
www.rsuitecms.com
Book: DITA For Practitioners, from XML Press,
http://xmlpress.net/publications/dita/practitioners-1/