assembly issues

Here is my original email about assembly with comments from Norm, Jirka, and Larry interleaved. Please review this before our meeting on Wednesday 11 April. I hope to resolve these issues so we can start the Errata process for DocBook 5.1.

Bob

On 3/15/2018 9:55 AM, Robert Stayton wrote:

I'm initiating this discussion among DocBook TC members off list because I need the memory banks of Jirka and Norm as well.

I'm revising the documentation for assembly and trying to figure out if we also need some Errata corrections. The element I'm really stuck on is the transform element, which I have never personally used and now that I'm looking at it, I don't seem to understand it. Let me be clear that I'm not blaming Norm for the problems I see, as I was also responsible for reviewing this doc.

For background, here is the link to the assembly schema in RNC:

https://docbook.org/xml/5.1/rng/

Here is a link to the transform element reference page in TDG:

https://tdg.docbook.org/tdg/5.1/transform.html

Here is a link to section 3.5 "Transformations" of TDG:

https://tdg.docbook.org/tdg/5.1/ch06.html#transformations

Here is the example used in Section 3.5:
<transforms>
  <transform name="dita2docbook" grammar="text/xsl" href=""/>
  <transform name="tutorial" grammar="text/xsl" href=""/>
  <transform name="art2pi" grammar="text/xsl" href=""/>
  <transform name="office" grammar="application/xproc+xml" href=""/>
  <transform name="office" grammar="text/xsl" href=""/>
</transforms>

---------------------------------------------------------------------------

        Larry: Above are just incorrect.  The grammar attribute here is
        used as a mime-type, and it is supposed to map to a vocabulary
        set (that is, a markup dialect such as DocBook or DITA or
        something similar).  Below show correct usage.

---------------------------------------------------------------------------

And here is the example in the transform element man page:
<transforms>
<transform grammar="dita" fileref="dita2docbook.xsl"/> <transform name="tutorial" fileref="docbook2tutorial.xsl"/> </transforms>

---------------------------------------------------------------------------

        Norm:

Those look like weird values for “@grammar” to me. I think there
should be a content type attribute for the media type, distinct from
the grammar.

In some cases, the content type is sufficient to identify the type of
resource, but for XML it often isn’t.

My understanding of @grammar is that it was for identifying the
“flavor” of markup: docbook, dita, html, etc.
---------------------------------------------------------------------------

Here are my questions that I hope to get help with.

1. What is the purpose of transform?

The man page says "for converting from a non-DocBook schema", which I presume means converting to DocBook. The man page Description says "A transform specifies a mechanism for converting one format into another during the assembly process." That implies the output of the transform is not necessarily DocBook. I'm not clear what the use case would be for generating non-DocBook output with a transform. Can anyone clarify?

---------------------------------------------------------------------------

Norm:

Assemblies mix together some fairly deep processing expectations with
an attempt at declarative markup. I’m not wholly satisfied with the
way it came out, but it’s what we all agreed it.

My understanding is that the transform element is a declaration: this
thing can translate from that format. It’s possibly an oversight that
we don’t have a way of saying what it’s translated into.

Jirka: I don't see why we would like to support any other kind of output than
DocBook.
---------------------------------------------------------------------------

Section 3.5 says "An assembly can identify a collection of transformations that can be used during the assembly process. A transformation can be associated with a resource (for example, to translate from some other format into DocBook), or with a module (to address requirements beyond the limited transformation capabilities of the assembly)." I interpret this to mean in both cases that a transform operates on a resource (a module points to a resource), and outputs DocBook, right?

---------------------------------------------------------------------------

Norm:

Given that we say they can be chained together, it’s possible that we
intentionally didn’t say anything about the output formats for what
might be intermediate transformations.

But the “last one” must be expected to generate DocBook, I think.

Larry:  Outputs are not necessarily DocBook.  Transforms are intended to serve two purposes:

1.      Normalize different markups to DocBook for combination into the resulting structures.

2.      Provide additional output capabilities beyond the standard DocBook outputs.  In the example above, the transform with name=”tutorial” would produce the output to feed a tutorial delivery engine from the DocBook resulting from the assembly process. 
---------------------------------------------------------------------------

2. What is the purpose and usage of the @grammar attribute on transform?

The man page says it "Identifies the markup grammar of a resource". I don't think this is referring to a <resource> element, because it is not associated with a <resource> element, but to the thing

---------------------------------------------------------------------------

Norm:

Both the resource and resources elements can have a grammar attribute.
So you could say:

  <resource grammar="dita" href=""/>

The expectation being that if you refer to that in a module, and if
there’s a transform for the dita grammar, it will be transformed
automatically when it’s included.

Larry: The grammar attribute appears on two classes of elements, sources and transform references.  They are used for matching.  If a source of content is marked with a grammar=”dita” and a transform is marked with a grammar=”dita” the transform will be used to convert the dita-based resource to DocBook before the assembly of the content.  The grammar-“text/xsl” is just incorrect and a red herring.
---------------------------------------------------------------------------

specified in the href attribute of the transform element. In the first example above: grammar="text/xsl", suggesting that the @href is an XSL stylesheet and an XSL processor should be used. That example looks like a MIME type, but @grammar can contain any text, so its meaning must be user defined, correct?

---------------------------------------------------------------------------

Norm:

Right. My comment above about content type is exactly this.
---------------------------------------------------------------------------

So I presume grammar should be used to select the processor that can handle the thing specified by the href attribute, correct?

---------------------------------------------------------------------------

Norm: I think that's a bug.

Jirka:

That seems strange and doesn't align with the fact that grammar and name
are exclusive. Perphaps there is missing additional attribute, something
like type that would indicate language in which transformation is written.
---------------------------------------------------------------------------

But in the second example from the man page it says grammar="dita", which seems to suggest that it refers to the schema of an input resource, in this case a dita document. That's why I'm not understanding the purpose of @grammar.

---------------------------------------------------------------------------

Jirka:

I think that grammar should hold name of vocabulary which can be
transformed to DocBook by the transformation. So grammar should be
contain values like "dita", "tei", etc.
---------------------------------------------------------------------------

Also, @grammar is allowed on <resource>, where the man page says it "Identifies the markup grammar of a resource". That would seem to align with the grammar="dita" example from the transform man page. If a transform element had a matching grammar="dita" then that would be a selection process for applying that transform.

---------------------------------------------------------------------------

Norm:

Exactly. My best guess is that the use of media types in grammar was
not carefully considered.
---------------------------------------------------------------------------

3. What is the purpose and usage of the @href attribute on transform?

First, the man page uses @fileref in the example, so that should be changed to @href. The man page says the href "Identifies the location of the data by URI". Here data is the stylesheet or script that is applied by the processor to the resource element to convert it to DocBook, correct?

---------------------------------------------------------------------------

Jirka: That's my understanding as well. Perhaps description of attribute was
copy'n'pasted from some other context and it's not completely clear in
this new content.
---------------------------------------------------------------------------

And @grammar tells it how to apply that stylesheet or script.

---------------------------------------------------------------------------

Norm:

I think href points to the transformation script (XSLT stylesheet,
XProc pipeline, whatever the processing environment supports). Grammar
identifies the kind of vocabulary that can be transformed with it. And
the element is currently missing a content-type attribute to identify
what kind of transformation technology it is.

Jirka: 
As I have written above there should be different (and currently missing
attribute) used for this, IMHO.

Larry: The @href is a link to the actual transform.  I think it was originally valid to provide @href or @fileref and the example was developed when either was valid.  I actually prefer using fileref for local and href for HTTP exposed resources because I think it makes it easier to understand things, but whichever the grammar allows works.  The @grammar is used as a match to @grammar elements on resources to say “use this transform to normalize this content to DocBook.”
---------------------------------------------------------------------------

4. What is the purpose and usage of the @name attribute on transform?

This is a NMTOKEN, so I presume it is used as an identifier of the transform, so that it can be referenced from other parts of the assembly. That makes sense, except for the fact that the schema rules of transform make @name and @grammar mutually exclusive. If @name is not used in a transform, how is it to be selected? If @grammar is not specified, then how does it know how to process the @href script? Is this mutual exclusion a mistake in the schema?

---------------------------------------------------------------------------

Norm:

We’re definitely off in “guessing” territory now. But my guess is that
@name is meant to identify a resource element. It probably isn’t
called linkend, and isn’t of type IDREF(S) because the way assemblies
can be composed it would often be the case that the IDs wouldn’t be
available until assembly time.

So you can point to the resource by href or you can point to a
resource. If you point to a resource, there’s an @grammar there and
not allowing one on the transform means you can’t miss-match them.

But I’m not very confident of my guess. For one thing, the example in
3.5 has transforms with both @name and @grammar which isn’t allowed.
Maybe those names should be xml:ids.

Larry: The reason that @name and @grammar are exclusive is that the two have different purposes.  I am not sure they have to be exclusive, I just didn’t see them being used at the same time.  The @grammar is used to match incoming content for normalization to DocBook and the @name is used to specify transforms OTHER than the standard DocBook rendering (mostly for output operations beyond the normal DocBook render types).  Making them non-exclusive is OK if there is another use model for the transforms.  I suspect it should actually be a IDREF, but I seem to remember there was some reason we made it a TOKEN instead, perhaps to support multipart assemblies.
---------------------------------------------------------------------------

5. Does a transform have a "type"?

The second paragraph of Section 3.5 says "If there are several ways to provide a transformation, they may all be listed provided that they have different types. In the example above, it may be that the XProc transformation from office documents to DocBook is superior to the XSLT-only transformation, but the XSLT-only transformation is better than nothing. If no type is specified, the default is implementation dependent." There is no @type attribute, so I'm not sure what this is referring to. Maybe @grammar?

---------------------------------------------------------------------------

Norm:

That’s where the content-type should be.

Jirka: That's. I think @type should be used there instead of  @grammar and then
it starts making sense.

Larry: This is something I don’t remember much about, but I think that the model is more one of fallback than one of type.  In that case, I would presume that two transforms with the same grammar would be selected based on order; that is, the XProc transform for Office to Docbook would be listed first, then the XSLT-only transformation, so that if XProc is available, it will be used, otherwise the XSLT transform will be used.  Otherwise, a priority indicator of some sort would have to be added.  Both of the transforms would have the same @grammar attribute (@grammar=”office”).  This is my understanding.
---------------------------------------------------------------------------

6. If the same @name value is used in more than one transform, what is the expected behavior?

The above paragraph suggests that one would be chosen based on some suitability criteria. The other possibility is that they would be applied in sequence (see below for more on multiple transforms). Or that it would be application defined?

---------------------------------------------------------------------------

Norm:

If @name is a pointer, it would just refer to the same resource.

Jirka: 
If we introdice @type in the above sense then assembly processor should
use transform element with supported @type and given name.

Larry: I believe the @name attribute should likely be unique rather than repeated.  Otherwise it would be selected the same way described above for duplicate @grammar values.  I would suspect this would be less used than the duplicate @grammar.
---------------------------------------------------------------------------

7. How is a transform specified to be applied?

The first example in Section 3.5 specifies it on a resource element as:
<resource xml:id="overview" href=""
          transform="dita2docbook"/>
This seems to make sense in that it references the transform by its name attribute, but resource does not permit a @transform attribute. It does have a @grammar attribute, but that does not refer to a NMTOKEN. I believe resource should support @transform, not @grammar, so it can point to a <transform> element. Would resource still need an @grammar attribute?

---------------------------------------------------------------------------

Norm:

I think this should be

  <resource xml:id="overview" href=""
            grammar="dita"/>

The processor should find an appropriate transformation from the list
of transforms.

Jirka: I don't know if I recall it correctly, but it could be that if

- resource has @grammar then appropriate transform is looked up based on
@grammar on transform

- resource has @transformation then appropriate transform is looked up
based on @name on transform

Larry: I believe the example is in error.  It should be @grammar=”dita”.  That would convert the DITA content to DocBook before assembly. 
---------------------------------------------------------------------------

The second example in Section 3.5 specifies it on a module element as:
<module resourceref="overview">
  <transform name="art2pi"/>
  <output type="book" renderas="partintro"/>
</module>
This suggests using a <transform> element as a child of module to refer to another transform element by name. But module does not have a <transform> child, nor does it support a @transform attribute. The <output> element has an @transform attribute, so I think this example should be changed to <output transform="art2pi"/>.

---------------------------------------------------------------------------

Norm:

Ah, this seems to just be broken. 

Larry: I believe this example is an error (and was based on an earlier, more complex schema).
I agree with that change.
---------------------------------------------------------------------------

Section 3.5 also says after the above example "In this case, two transformations will occur." The two transformations it mentions are the one in the resource and then the one in the module.

---------------------------------------------------------------------------

Norm:

I think the intent is that the “overview” module (presumably an
article) is to be transformed into a book by the transform (despite
its misleading name) and subsequently the standard transformation will
turn the book into a partintro.

Jirka: Perhaps yes and such functionality is there to support more complex
scenarios where @renderas is not sufficient.
---------------------------------------------------------------------------

Then the paragraph says "This can be generalized to an arbitrary number by listing more than one transform in the module. The transforms are applied in the order specified." The generalization of adding multiple <transform> children to <module> is not correct, because that element is not permitted in the schema. In fact, I don't see any way to specify a sequence of transforms beyond one @transform on resource and one on output.

---------------------------------------------------------------------------

Norm:

I expect that we did allow transform as a child of module at one
point. Either we decided this complicated chaining of transformations
was more than we needed, or we removed it by mistake.

Larry:  I agree that the description is incorrect.  There might be multiple output elements with different processes applied for the different output destinations.
---------------------------------------------------------------------------

8. Does assembly support a selection or fallback mechanism for applying transforms?

Repeating here the second paragraph of Section 3.5 which says "If there are several ways to provide a transformation, they may all be listed provided that they have different types. In the example above, it may be that the XProc transformation from office documents to DocBook is superior to the XSLT-only transformation, but the XSLT-only transformation is better than nothing. If no type is specified, the default is implementation dependent."

That suggests a selection mechanism. Is this application defined?

---------------------------------------------------------------------------

Norm:

Application-dependent, I think. You can list several, and the
processor is expected to know which one is “best”. That seems a
bit…speculative to me.

Larry: I am not really clear on this.  I don’t think fallback other than to the normal DocBook transforms for output is necessarily required, but fallback is something I have not really dealt with – if the system fails to correctly specify things, they typically fail and we fix them in our production environments (which are not currently using assemblies). 
---------------------------------------------------------------------------

Suggested schema changes

Based on my previous notes and questions, I suggest the following changes to the assembly schema:

1. Allow both @grammar and @name to appear on <transform>.

---------------------------------------------------------------------------

Larry: I don’t see the need for this since they are actually selection mechanisms for different classes of transforms. However, this could be discussed.

---------------------------------------------------------------------------

2. Require @name on <transform> as its address for selection.

---------------------------------------------------------------------------

Larry: I believe this is not appropriate since the @grammar match is used for some selections. I think that either @grammar or @name should be required.

---------------------------------------------------------------------------

3. Allow @transform to appear in <resource> to specify its transform.

---------------------------------------------------------------------------

Larry: I don’t think this is necessary since @grammar does this matching.

---------------------------------------------------------------------------

4. Remove @grammar from <output>, so @transform is used to select a transform by name.

---------------------------------------------------------------------------

Larry: I think this is reasonable (don’t remember why it was there originally, but it was a long time ago).

Norm:

Assuming we all agree on the proposal, after we’ve explored our
respective understandings of the situation, I’ll undertake to fix the
documentation as appropriate.

Jirka: 
That's one possibility. Second one is to introduce @type on tranform and
keep two ways of addressing transformation -- by name of the transform
or by the name of vocabulary (@grammar) that it supports. But it
probably doesn't make sense to support two different ways of referencing
transforms, so your prposal is more sound. But I think that in this case
@grammar should be renamed to @type.

I know that people from XML Mind XML editor are probably one of fews
that implemented Assemblies, so it could make sense to discuss proposal
with them once it's coherent.
---------------------------------------------------------------------------

Suggested changes to the documentation

1. Describe that the intended use of transforms is to convert non-DocBook resources into DocBook so they can be merged in an assembly. Any other use of transforms is application defined.

---------------------------------------------------------------------------

Larry: The transform element is also used to define specialized transformations for output types that go beyond those supported directly by the DocBook transformations.

---------------------------------------------------------------------------

2. Tell people to apply a transform by using @transform to refer to a @name value in a <transform>. This applies to both <resource> and <output> in a module.

---------------------------------------------------------------------------

Larry: Disagree. The @grammar match should be used for binding transforms to <resource> elements.

---------------------------------------------------------------------------

3. Define @grammar in a <transform> to be the grammar of its @href so the correct processor can be selected.

---------------------------------------------------------------------------

Larry: Agree, but see below.

---------------------------------------------------------------------------

3b. Define @grammar in <resource> to define the schema of the resource, not as a means to select a transform.

---------------------------------------------------------------------------

Larry: Disagree. It is also the mechanism for matching the required transform for converting the source schema to normalized DocBook.

---------------------------------------------------------------------------

4. If more than one transform has the same name, then the behavior is application defined. It could perform a selection based on some criteria or it could be sequential. Or it could be profiled to select one at runtime.

---------------------------------------------------------------------------

Larry: I am ambivalent about this. This is probably the correct behavior, but I have few enough transforms in any environment I work with that it would likely be a single transform of any given type that would be provided. However, for portability I could see this being a reasonable description.

---------------------------------------------------------------------------

5. A sequence of transforms on a resource can be specified by adding @transform to <resource>, and another @transform to an <output> element in a <module> that refers to that resource. If they are the same name value, then the transformation is done only once. If the names are different, first the resource transform is applied, then the output transform. If a module has multiple output elements each with @transform, then those are applied in sequence based on document order.

---------------------------------------------------------------------------

Larry: I would not see this as being a common occurrence, since I envision transforms bound to resource elements being used for normalization to DocBook and transforms being bound to output elements being used for accomplishing specific classes of complex outputs (like producing content to feed a tutorial delivery engine). This is one reason I think the schema does not allow @name and @grammar on the same transform, since the two attributes are for different mappings.

---------------------------------------------------------------------------

I am open to further suggestions based on the discussions I hope to generate.

Bob

docbook-tc message