dita message

Subject: DITA 2.0 Stage 1 proposal - A storage mechanism for resolved reuse content

From: David Hollis <david@tdandc.co.uk>
To: DITA TC <dita@lists.oasis-open.org>
Date: Thu, 17 Jan 2019 17:19:19 +0000

Hi all,

Please find below a DITA 2.0 stage 1 proposal - A storage mechanism for resolved reuse content

I think the content model would be relatively straightforward.

Any difficulty would be with the DOT and other DITA processors. They might need quite a bit of work!

Long term, it should make life easier for CMS, editor and translation tool vendors.

It could revolutionise DITA reuse: authors would be able to physically see it in action, in front of them, in the editor.

It should make reused content statistics more meaningful and manageable. In turn, this could justify ROI, and the move to DITA.

I acknowledge that this is probably quite ambitious.

Many thanks,
David

Background

One of the core aspects of DITA is the multiple reuse mechanisms. It's why a lot of companies chose to use DITA.

Fundamental to these reuse mechanisms is that they require resolution. Not just by the processor to create an output format, but at every stage during content creation: authoring, reviewing and translation. This requires the tool to be able to resolve the reuse mechanism on the fly, as the topic opens.

1. This has the potential for delays. Especially for large, global CMS systems with server replication. It takes longer to open a DITA topic, despite a typically small file size, than practically any other file type. Given comparable file systems.

2. On-the-fly resolution has to happen many more times than is necessary. It is normal practice to set up reused content related to a new product at the start of a project, and it changes little during the project. A project might require a new warning or caution, but that doesn't take too long. So, authoring time related to reused content is, let's say, less than 10% of the whole content creation effort related to a project. That is an entirely arbitrary figure, but you get the idea.

3. Some tools cannot resolve DITA reuse mechanisms on the fly, or perhaps the map is missing. The author, editor or translator then sees an ugly 'lump' of meaningless, raw XML content. What are they supposed to do with that? It might be a product name, but how are they to know?

4. A topic has been reused many times. A new engineer joins the team, and is not aware of how the documentation is built, and its reliance upon reuse. A simple job, one that helps him learn the product, is for him to review the documentation. He's keen, and makes a number of suggestions. Some of these suggestions impact the reuse viability of a particular topic. How is an author supposed to spot this?

5. A competent, DITA-savvy, CMS probably offers a 'dependencies' feature. Trying to follow dependencies first up one map tree, and then down another can be very futile. It can feel like 'chasing Alice down the rabbit hole'.

6. Every time a document is built, the resolution of reused content takes place. The results of that resolution are thrown away.

Storage Mechanism

So, what if DITA 2.0 could contain a means to capture and store the resolved reused content?

It would mean the end to on-the-fly resolution, and the end to additional file opening delays. It would be as easy, and as quick, to open a DITA topic as any other file, given comparable file systems. Tools would need to be far less DITA-savvy because they would not need to perform on-the-fly resolution. So, an end to meaningless, raw XML content.

A storage mechanism would be able to capture the resolved reused content for every reused instance of that topic. An author would be able to open the topic, without the need to define a resolution map, and they would be able to choose the resolved reused content for any of the multiple instances of that topic. Say, by choosing the title of an output build map from a drop-down list. They would be able to immediately see any potential problems related to review comments. They would be able to see how a reused feature list, say, changes and grows for each individual product, and instance of the topic.

It would also be possible to directly navigate from the reused content source to every topic that reuses the content. And vice versa, from reused content target back to the content source. This could be very useful for the management of warnings and cautions, to reduce duplication, and when product development mitigates against warnings. Medical companies work hard to mitigate against warnings, and reduce the number required.

Implementation

I don't know the correct terminology, but it would use a top level 'area' or 'block' in every topic model. That is, a new block that is parallel with <title>, <prolog>, <body>, and <related-links>.

An important aspect is that the content placed in this block would only be by the processor. It must not be editable by an author. If an author knows that there are recent changes to a reused content source, they could initiate one or more 'resolution builds' to rebuild the resolved reused content.

Ideally, an editor or CMS would store a list of files to watch, and automatically initiate the relevant resolution builds to refresh the resolved reused content. A CMS might also do overnight resolution builds to keep content fresh.

A resolution build would not necessarily produce an actual output document, just refresh the reused content resolution.

In practice, the exact implementation would be chosen by the editor or CMS, and probably in conjunction with user preferences. Some options might be:

1. Put the resolved reused content back into the original topic file. That might tempt the author to edit it, rather than go to the content source. The CMS or editor would need to ensure that no topic files are open during the build.

2. A second parallel or 'ghost' file for each topic. This file would probably have the same title as the main topic, but would only have the resolved reused content block. No <prolog>, <body>, or <related-links>. The content in this block would include every reused instance for the topic.

3. Essentially the same as 2. But, there would be a parallel or ghost file for every reused instance of a topic. Rather than all reused instances in the one file. This would have the advantage that it would be very easy to add up the reused instances, and produce reuse statistics.

When an editor or CMS opens a topic, it would need to read the main topic and one of the reused content resolutions. The author might be able to set a preference for a particular build map that the editor uses each time it opens a topic. It would need to be able to link the content source to the content target, and replace one with the other.

It should be possible to send a single topic for translation, without accompanying maps. In practice, XLIFF or other tool would either merge the main and ghost files, or include two or more files in the manifest. The translation tool would do the same as the editor, and also mark the reused content as non-translatable, but visible to the translator. For cost purposes, the tool would count only the actual words in the main topic, not XML 'lumps', and totally ignore the resolved reused content.

This might lead to a new two stage translation work flow:

1. Translate reused content sources, and review in conjunction with the English, or first language, main content. Look for any gender and noun declension issues related to the reused content. Get approval for reused content translations.

2. Use the translated reused content source, and translate the main content.

Content Model

It would be necessary to figure out all the permutations and combinations for maps, topics, keys, scoped key, filters and branch filters. This ought to be a closed set of combinations. For instance, a topic might be introduced by a <keydef @href> in one map, and a <topicref @keys> in another map. It might mean a trip into 'worst practice', e.g. 'spaghetti reuse'.

A reduced form of this content model could be used for maps.

The content model would be a hierarchy of elements:

1. The title and link to a build or root map.

2a A hierarchy of any intermediate maps: title and link.

2b Any additional map that defines keys or filters relevant to the topic: title and link.

3 Any map that defines scoped keys or branch filters: title and link.

4a For any reused content source topic, the title and link for any target topic.

4b For the target topic of any reused content, the title, content and link to the source topic.

Follow-Ups:
- Re: [dita] DITA 2.0 Stage 1 proposal - A storage mechanism for resolved reuse content
  - From: Kristen James Eberlein <kris@eberleinconsulting.com>