dita message

Subject: Re: [dita] Stage 2 proposal: chunking redesign

From: "Robert D Anderson" <robander@us.ibm.com>
To: Chris Nitchie <chris.nitchie@oberontech.com>
Date: Mon, 2 Apr 2018 10:43:54 -0500

I think this is largely correct.

For the (probably) easier question, about how nested topicref elements are reflected in the hierarchy - I struggled with this in the proposal, and really only picked my version because overall I was trying to simplify and that seemed the easiest to explain / understand. But I felt an equal pull towards Chris's suggestion, so I'm happy to go with that. I think we clearly need to state what the expected result is, but I intentionally used the SHOULD term so that there was an outlet if somebody had a solid reason to end up with a different hierarchy.

With Chris's solution, we also need to address how to handle documents that use the <dita> container; for example, if composite.dita has a root <dita> element with 5 child topics (A, B, C, D, and E), and any/all of those have their own children, where would "difficultChild.dita" end up?
<topicref href="">
<topicref href="">
</topicref>

My assumption would be: as a direct child of the final "root" topic in the composite file -- assuming E is the last child topic of <dita>, then difficultChild would end up as the last child topic of E.

For the issue of keys ... yeah. Part of me wants to say "If you need to independently address specific instances of a nested topic that's going to get chunked, that is incompatible with relationship tables." I don't really think that's allowed though.

I do worry that any rules for automating keys are a big potential point of confusion, and that scares me. Immediately coming to mind:
* What is the precedence of a key that is automatically constructed, but then becomes a duplicate of one already in the map?
* If keys are automated based on IDs of the actual nested topics, then you cannot actually know your "key space" until after you've read the topics, which seems wrong

But like Chris, I don't have any good answer for this.

Regards,
Robert D. Anderson
DITA-OT lead and Co-editor DITA 1.3 specification,
Digital Services Group

E-mail: robander@us.ibm.com
Digital Services Group

11501 BURNET RD,, TX, 78758-3400, AUSTIN, USA

Chris Nitchie ---04/02/2018 08:53:24 AM---I thing ignoring chunking inside reltables is fine, and that content should be identified by specifi

From: Chris Nitchie <chris.nitchie@oberontech.com>
To: Robert D Anderson <robander@us.ibm.com>, DITA Technical Committee <dita@lists.oasis-open.org>
Date: 04/02/2018 08:53 AM
Subject: Re: [dita] Stage 2 proposal: chunking redesign
Sent by: <dita@lists.oasis-open.org>

I thing ignoring chunking inside reltables is fine, and that content should be identified by specific topic fragment identifiers, as I think you’re describing. But I think there’s something of a rabbit hole here I’m going to try to peek into.

<beleaguered-sigh>Keys.</beleaguered-sigh>

As I think we’ve discussed on the TC before, identifying entries in a relationship table by @href is somewhat problematic, as a given topic document may exist in multiple locations of the resolved map tree, with different scoped key bindings and parent/child/sibling relationships. As such, probably the best way to define relationship participants in a reltable is via key, thus referencing a specific instance of the desired topic.

In the case of chunk=”split”, the split-out chunks won’t have keys assigned to them. This is problematic for reltables but also for other garden-variety keyref-based linking. Unfortunately, I don’t really see a great way to accomplish that. My best suggestion would involve computing key names using the root ID of the topics in the referenced document combined with either the @chunk-bearing topicref’s key(s) (if any) or, less desirably, something in the <topicmeta> of that topicref.

On a separate note, this proposal states the following:

In all cases, when a DITA document is split into multiple documents, the hierarchy of the topics in that document must be preserved in the resulting <topicref> hierarchy that references each generated document. Any nested <topicref>elements within the original <topicref> SHOULD be treated as if they are nested within the final <topicref> from the chunk result.

I think this is saying that if a map references a compound topic hierarchy such that the final topic is nested several levels deep, any nested topicrefs should be placed beneath that deeply-nested child. I think I’d rather they be placed as siblings of the first depth level, which is to say, as the last immediate child of the chunking topicref.

Example:

Map:
<topicref href="" chunk=”split”>
<topicref href="">
</topicref>

Compound.dita:
<topic id=”c1”>
<title>Topic 1</title>
<topic id=”c1.1”>
<title>Topic 1.1</title>
</topic>
<topic id=”c1.2”>
<title>Topic 1.2</title>
<topic id=”c1.2.1”>
<title>Topic 1.2.1</title>
</topic>
</topic>
</topic>

The resulting topicref hierarchy, according to this proposal, would be thus:

<topicref href="">
<topicref href="">
<topicref href="">
<topicref href="">
<topicref href="">
</topicref>
</topicref>
</topicref>

Whereas I’m arguing for:

<topicref href="">
<topicref href="">
<topicref href="">
<topicref href="">
</topicref>
<topicref href="">
</topicref>

Chris

From: <dita@lists.oasis-open.org> on behalf of Robert D Anderson <robander@us.ibm.com>
Date: Monday, March 26, 2018 at 10:27 AM
To: DITA Technical Committee <dita@lists.oasis-open.org>
Subject: [dita] Stage 2 proposal: chunking redesign

Hi all,

This one changed a bit more than expected from my original vision at stage 1, thanks to feedback from the initial TC discussion. Based on feedback from Stan's review of my first stage 2 draft, I've also changed the proposed chunk token for combining documents from my original idea ("merge") to one more closely aligned with the idea of combining documents (that is, "combine").

I expect that this one will probably result in a fair bit of discussion and possibly more changes. Looking forward to the discussion...

DITA 2.0 proposed feature #105: Redesign chunking

Simplify how the @chunk attribute is defined to 1) make it easier to use, and 2) make implementation easier and more reliable.

Date and version information
Include the following information:
Date that this feature proposal was completed

14 March 2018

Champion of the proposal

Robert D Anderson

Links to any previous versions of the proposal

Stage 1 proposal 28 Feb 2018:

https://lists.oasis-open.org/archives/dita/201802/msg00106.html

Links to minutes where this proposal was discussed at stage 1 and moved to stage 2

https://www.oasis-open.org/committees/download.php/62726/minutes20180313.txt

, with Eliot and Stan as reviewers

Links to e-mail discussion that resulted in new versions of the proposal

xxx

Link to the GitHub issue

https://github.com/oasis-tcs/dita/issues/105

Original requirement or use case
Redesign the chunk attribute for the following reasons / benefits:

Make it easier to use (rename useful tokens to intuitive values)
Make it easier to implement (discard operations that are not useful or are edge cases)

Use cases

Make it easier to use: the current tokens are not obvious. To someone who is not already very familiar with the values, the most common values (chunk="to-content" to combine content and chunk="by-topic" to split content) are not intuitive and require frequent use of DITA reference documentation. It's also not clear from the value-names whether those values would apply to the referenced document, child documents in a map hierarchy, or both. Replacing these two tokens
Make it easier to use and implement: the current attribute tries to do too many things at one time, resulting in complex and difficult-to-implement attribute values. The attribute values in DITA 1.3 attempt to do three things at once (select how much of a single document should be published, decide how to combine multiple documents, and decide how to render those combinations). To use these effectively an author must use multiple tokens in the single attribute. All 7 original tokens for these functions are non-obvious, making it very difficult to know which and what combination of these tokens are necessary. Most of the additional behaviors are rarely (if ever used), resulting in little or no benefit from the additional values.
Make the spec (and implementations) easier to maintain: the original 7 values were defined in a topic on chunking in DITA 1.1. The values were not rigorously explained, and there was no explanation of how they interacted. Later versions of the specification attempted to clarify some of the missing information, but this topic has been very difficult to work with given the need to support all interpretations of what came in with the first version. In addition, implementations have often been unclear about how to handle combinations; using DITA-OT as an example, the need to handle many (often non-sensical) possible combinations has resulted in very complex, error-prone, and hard to maintain code.

New terminology

N/A

Proposed solution

The overall goal with this solution is to preserve (mostly intact) the two most useful existing cases for chunking.

Important:

The chunking function, as with features like

@conref

, is a DITA-defined operation related to processing DITA documents. As such, the specification can only declare the before and after state of all DITA documents that implement the feature, in the context of processing the documents for some other purpose. For example, a DITA document many.dita might be chunked into many topic documents during rendering, but (again like

@conref

) the before/after state still deals with the DITA content. Any examples that make use of published HTML file names are purely for illustration / ease of understanding.

Because the chunking operation is defined in terms of processing, the values below are not meant as

tool operations on the source

, such as "refactor my source to reflect these new chunks". The result of evaluating

@chunk

is no longer a source file, and does not need to exist as an actual file (it may be an object in memory somewhere).

This entire function is intended for situations where splitting or combining content is relevant, & where authors need control over how that happens. In nearly all cases, chunking will be irrelevant for monolithic publishing formats like PDF or EPUB. Likewise, published HTML is often multi-file and so typically makes use of chunking. However, neither of these is always the case – local style may dictate that PDFs are split at some level, or that HTML is always generated as a single file. As such, we need to be careful that the specification allows

@chunk

to be ignored when needed. This also means that the specification itself cannot know in advance when this is the case or for what formats this is the case – the best we can do is give examples of common cases.

These are the two operations people already think of or look for when they ask about chunking: the ability to publish many documents as if they were one, and the ability to publish one document as if it was many. To that end, the proposed solution is:

Remove all of the current

@chunk

token values (one value,

to-navigation

, is already deprecated).

Define one new value

combine

to handle the most common scenario, combining multiple DITA documents from a map into one while preserving the overall hierarchy of the map.

When specified on a map, it means that all documents referenced by the map should be combined into one DITA document.
When specified on a branch of a map, it means that all documents referenced within that branch should be combined into one. This is true regardless of whether the element that specifies @chunk refers to a topic or specifies a heading. In cases such as <topicgroup> where a grouping element specifies chunk="combine", the result is likely to be a single DITA document with a <dita> root element containing peer topics.
When chunk="combine" is specified on a reference to a map, it indicates that all documents within the scope of the referenced map should be combined into one DITA document.
Once chunk="combine" is specified on a map, branch, or map reference, all documents in scope are combined into a single resource. Any additional @chunk attributes on elements within the hierarchy are ignored. (This is based on a response to my original stage 1 proposal.)

Define one new value

split

to handle the second most common scenario of splitting one DITA document into many.

When specified on a <topicref>, it indicates that all topics within the referenced document should be split into multiple documents. For example, in a context where each individual DITA document is published as a single HTML file, specifying chunk="split" on a reference to a document that contains five topics will result in 5 documents + 5 output files.
When specified on an element such as <topicgroup> that does not refer to a topic or have content that is treated as a topic, the value has no meaning.
In each of the above cases (chunk="split" specified on a <topicref> rather than on a map), there is no cascading for the @chunk value; if contained topic references do not specify any @chunk attributes, they will use whatever default chunking style is in operation for the rest of the map.
When specified on the root map, it indicates that chunk="split" is the default operation for all documents in the map, outside the context of relationship tables. The split value is used until / unless a "combine" value is encountered, in which "combine" takes over.
When specified on a submap, it indicates that chunk="split" is the default operation for all documents within the scope of that map, outside the context of relationship tables, until / unless a "combine" value is encountered, in which "combine" takes over.
I would like feedback on the relationship table exception in the previous two items. I don't think it makes sense for documents to be split inside of a relationship table; doing so would result in far more links than expected. I think that if you do want links between specific topics, those can and should be specified in the source relationship table. Basically, what I'm going for here is "do what people would expect", while also trying not to overcomplicate things.
In all cases, when a DITA document is split into multiple documents, the hierarchy of the topics in that document must be preserved in the resulting <topicref> hierarchy that references each generated document. Any nested <topicref> elements within the original <topicref> SHOULD be treated as if they are nested within the final <topicref> from the chunk result.
In all cases, when a DITA document is split into multiple DITA documents, file names are up to the implementation. (We could suggest that file names be taken from topic IDs, but implementations must be free to choose naming schemes that make sense in their context, and would regardless have to handle conflicting IDs.)

When links exist to a topic that is chunked, applications will need to handle the link so that it resolves to the new combined or split context. If a chunking operation results in multiple instances of a result topic (either chunked separately, or some chunked and some not), applications may determine which result topic to target with the link.

This attribute should still be defined as CDATA, which would allow applications to define additional tokens, although I expect those will be rare. One potential advantage to this approach is that DITA 1.x tokens would still remain valid according to the parser (but ignored by 2.0 processors). I propose that we avoid some of the DITA 1.x confusion by stating that the attribute can only contain a single token (note this would mean some potential DITA 1.x values are no longer valid).

All remaining behaviors associated with DITA 1.x chunking are no longer supported by this attribute. The original tokens declared several unrelated behaviors using a single attribute. I suggest that

if any of those other behaviors are still required

, alternate attributes be defined to handle them. I do not intend to define those attributes as part of this proposal. That work should only be done if somebody has a strong need for the attributes.

Benefits
Who will benefit from this feature?

Authors wishing to combine or split documents

Those trying to implement chunking in a processor

Maintainers of the DITA specification and of DITA tools who can now provide a clear explanation of the function

What is the expected benefit?

Chunking is easier to use

Chunking is easier to implement

Improved documentation (in the spec and elsewhere)

DITA is simplified by making the feature more intuitive and by removing features that are not used + make the simple case difficult

How many people probably will make use of this feature?

Many, based on my own experience and based on the number of open defect reports against DITA-OT chunking

How much of a positive impact is expected for the users who will make use of the feature?

Significant improvement over the current feature

Technical requirements
Renaming or refactoring elements and attributes

Renaming or refactoring an attribute

The current attribute values are not defined in the grammar file, so the grammar definition does not change.

In the specification, the current definition for all 6 valid values and 1 deprecated value will be removed. The will be replaced with the two values "combine" and "split".

This applies to all uses of the @chunk attribute; no elements will get the attribute that did not have it before, and no elements that had the attribute will have it removed.

Processing impact

The chunk="combine" attribute value will be equivalent to the current chunk="to-content" value when no other chunk tokens are present, so implementations will need to adjust for the new value.
The chunk="split" attribute value will be equivalent to the current chunk="by-topic" value when no other chunk tokens are present, so implementations will need to adjust for the new value.
Removing other tokens, removing the ability to combine unrelated tokens, and defining these new values in a way that is clear & simple will all allow applications to remove the many conditions required to support old values.
The feature may have an impact on how result documents are named, but that should generally be left up to processors.

Overall usability

The chunking feature today is hard to use and hard to implement. This should address both concerns, resulting in a much more usable experience in all aspects of DITA chunking.

Backwards compatibility
Was this change previously announced in an earlier version of DITA?

No, although I have personally described this in public venues as one feature sure to be redesigned in DITA 2.0.

Changing the meaning of an element or attribute in a way that would disallow existing usage?

Yes; for the most common uses (possibly for the only real-world uses), the migration path is clear.

Migration plan
Documents

The easiest path is likely to use search/replace across DITA maps, and update chunking tokens to use the new value.

Processors

Processors will need to be updated to handle the new tokens. As a first approach they could simply treat the new values the same way as older equivalent values, but I would expect that over time many tools will want to replace older chunking processes with new ones.

Might any existing specialization or constraint modules need to be migrated?

Unlikely, although possible in theory. The only case where this could happen is if a module was designed to explicitly enumerate values for

@chunk

. In that case, the same modules would need to allow for the new tokens.

Costs
Maintainers of the grammar files

N/A

Editors of the DITA specification

How many new topics will be required? 0
How many existing topics will need to be edited? Two topics will need significant editing (Using the @chunk attribute and chunking examples. We may wish to revise the current definition of @chunk in the attribute details topic, although the current definition actually works better for the new design than for the old.
Will the feature require substantial changes to the information architecture of the DITA specification? No.
If there is new terminology, is it likely to conflict with any usage of those terms in the existing specification? N/A

Vendors of tools

Tools that implement chunking can take a quick approach (interpret the new values exactly the same as old ones), which should have minimal cost. Alternatively, they may wish to rewrite the chunking process, which will have a larger cost (hard to specify exactly due to widely different tool scenarios).

DITA community-at-large

This should not add to the perception that DITA is too complex (it should do the opposite).
It should be simple for end users to understand - that is the primary goal. This should be much simpler than the current design.
Backwards compatibility: any documents that use chunking today will require migration, but with low cost (search-and-replace which can be done before or after the switch to a fully DITA 2.0 environment).

Producing migration instructions or tools

Migration instructions (as part of a larger migration document) will be minimal, likely only a few paragraphs with small code examples.
No independent publication for this feature migration will be needed.
If other tools exist to migrate DITA documents, this would be an easy addition to those tools, but absent that tool this would be more easily done with search-and-replace routines.

Examples
Figure 1. Creating a single monolithic result document from a root map
<map chunk="combine">
<title>Previously this would have been chunk="to-content"</title>
<topicref href=""> <topicref href=""> ...
</map>

Figure 2. Creating multiple result documents from a single document

In the case where hello.dita contains 5 topics (either nested or peers within a <dita> element), the following markup would result in hello.dita being split into 5 individual documents. How the documents are handled at that point is up to the processor (in HTML5 output where one input file generally = one output file, this would turn hello.dita into five output files, presumably named after topic IDs within the original document). Note that the chunk="split" value has no impact on the nested reference notchunked.dita; in the resulting hierarchy, the reference to notchunked.dita should end up nested within the final topic split from hello.dita.

<map>
<title>Previously this would have used chunk="by-topic"</title>
<topicref href="" chunk="split">
<topicref href=""> </topicref>
<topicref href=""> ...
</map>

Figure 3. Creating multiple result documents from every source DITA document

In the case where hello.dita and world.dita each contain 5 topics each (either nested or peers within a <dita> element), the following markup would result in the two original documents being split into 10 individual documents, with the same handling caveats as above.

<map chunk="split">
<title>Previously this would have used chunk="by-topic"</title>
<topicref href=""> <topicref href=""> </topicref>
</map>

Figure 4. Explicit example of split topic with resulting hierarchy

Assume the very simple map below with a single topic simple.dita, and the contents of simple.dita are also shown.

<map>
<title>Very simple "split" example</title>
<topicref href=""></map>

simple.dita:
<topic id="a">
<title>Root topic</title>
<body>...</body>
<topic id="b">
<title>Sub-topic</title>
<body>...</body>
<topic id="c">
<title>sub-sub-topic</title>
<body>...</body>
</topic>
</topic>
<topic id="jumpup">
<title>another sub-topic</title>
<body>...</body>
</topic>
</topic>
The document simple.dita contains four topics; the chunking operation split effectively results in the following map, with each document containing only one topic. For this sample the file names are taken from the topic IDs for clarity but this is not required.
<map>
<title>Very simple "split" example</title>
<topicref href=""> <topicref href=""> <topicref href=""> </topicref>
<topicref href=""> </topicref>
</map>
Figure 5. "split" when used on a grouping element
Assume the following map, where chunk="split" is used on grouping elements:
<map>
<title>Groups are split</title>
<topicgroup chunk="split">
<topicref href=""> <topicref href=""> </topicgroup>
<topichead chunk="split">
<topicmeta><navtitle>Heading for a branch</navtitle></topicmeta>
<topicref href=""> <topicref href=""> </topichead>
</map>

In the case of the <topicgroup> element, the @chunk value is ignored; it does not cascade, and there is no referenced topic, so it has no effect.
In the case of the <topichead> element, in many applications, the title is equivalent to a single title-only topic. In this case the @chunk value also has no effect; if the <topichead> is treated as a title-only topic, it cannot be split further, and if it is ignored for the current processing context, it is treated no differently than <topicgroup>.

Figure 6. "combine" when used on a grouping element
Assume the following map, where chunk="combine" is used on grouping elements:
<map>
<title>Groups are combined</title>
<topicgroup chunk="combine">
<topicref href=""> <topicref href=""> </topicgroup>
<topichead chunk="combine">
<topicmeta><navtitle>Heading for a branch</navtitle></topicmeta>
<topicref href=""> <topicref href=""> </topichead>
</map>

In the case of the <topicgroup> element, the @chunk value results in a single DITA document that includes the contents of both ingroup1.dita and ingroup2.dita. A literal, DITA-valid file representation of the resulting content would presumably include each of those as peers within a <dita> containter).
In the case of the <topichead> element, the @chunk value also results in a single DITA document. Again, in many applications, the title is equivalent to a single title-only topic. In that case, a file representation of the resulting content would include the contents of inhead1.dita and inhead2.dita as children of the topic with "Heading for a branch" as the title. If <topichead> is ignored for the current processing context, the result would be the same as with <topicgroup>, where the contents of inhead1.dita and inhead2.dita become peers within a <dita> element..

Figure 7. Edge case: "split" becomes "combine"

Assume the following map, where chunk="split" on the root element means that all topics within this map structure are split by default, but a branch within the map sets chunk="combine":

<map chunk="split">
<title>Split most, but not one branch</title>
<topicref href=""> <topicref href="" chunk="combine">...</topicref>
<topicref href=""></topicref>
Assume as well that no other @chunk attributes are specified in this map. The following is true:

The document splitme.dita and all documents within that branch will be split apart if they contain more than one topic

Because of the

chunk="combine"

setting, the second branch with exception.dita at the root will result in a single result document

The document splitmetoo.dita and all documents within that branch will be split apart if they contain more than one topic

Figure 8. Edge case: ignoring "split" values within a combined branch
Assume the following map, where a branch is combined, but a nested <topicref> specifies "split":
<map>
<title>Ignoring split value</title>
<topicref href="" chunk="combine">
<topicref href=""> <topicref href="" chunk="split"/>
<topicref href=""> <topicref>
...
</map>
In this case:

The branch beginning with bigBranch.dita results in a single, combined document

In the combined document, the contents of iamhappy.dita, iamconfused.dita, and happyagain.dita are all peers within the final topic of bigBranch.dita

The

chunk="split"

value within the branch is ignored

Regards, Robert D. Anderson DITA-OT lead and Co-editor DITA 1.3 specification, Digital Services Group

E-mail: robander@us.ibm.com Digital Services Group
11501 BURNET RD,, TX, 78758-3400, AUSTIN, USA

Follow-Ups:
- Re: [dita] Stage 2 proposal: chunking redesign
  - From: Eliot Kimber <ekimber@contrext.com>

References:
- Re: [dita] Stage 2 proposal: chunking redesign
  - From: Chris Nitchie <chris.nitchie@oberontech.com>