Hi all,

Apologies for the delay in completing my action item from the meeting on the 14^th: “Joe will write up notes on RDFA from meeting”. Thanks to Nancy for the detailed and speedy work on the minutes from that call.

I wanted to get my facts right, and I had another suggestion to add, so I did some further research and experimentation to pull the notes together. Also, I initially wrote a much longer piece with a lot on metadata basics, terminology, useful links, and some illustrations — I think it’s too much to expect the TC to wade through although I will make it available on my blog so that anyone, including non-TC members, could perhaps understand metadata in DITA a little better, or add their comments.

Anyway, here’s what I have — I hope it’s useful and helps to spark some further discussion and wider participation concerning this important area. (I could also send a Word or PDF version if that would be helpful — I’m not sure whether the list accepts attachments.)

Kind regards,

Joe

Notes on support for working alongside external metadata in DITA

These notes are about improving interoperability between DITA and external tools, standards, and vocabularies for working with metadata. The intention is not to add support for any one vocabulary, as such things constantly change and that is not DITA’s domain. Rather the intention is to better support the frameworks behind such vocabularies.

The National Information Standards Organization (NISO) lists some “Notable Metadata Languages: Examples in Broad Use” (http://www.niso.org/apps/group_public/download.php/17446/Understanding%20Metadata.pdf)

Most of these languages are based on the RDF framework, or at least have a published version that uses it. In this way they support a Linked Data approach, where metadata containers and data are unambiguously identified by URIs, and can be linked effectively across different systems and organization- or industry-specific usages, without the need for proprietary mapping mechanisms.

Also, the Simple Knowledge Organization System (SKOS), a W3C standard for taxonomy and thesaurus data, uses the Linked Data approach. SKOS is very common, and in some cases mandatory, in the European public sector, as well as being widely used elsewhere. In SKOS, each “concept” (all readable labels. synonyms, and translations, that represent the same idea or term in a taxonomy) is identified by a URI.

Regardless of RDF and Linked Data, it is a best practice to use unique identifiers for controlled vocabulary values — identifiers that are distinct from the readable label/s for those values. This enables robust integration between systems such as CMSs with faceted search features, CCMSs, taxonomy management tools, and other data sources. It means that whenever a label for a value is changed or its relationships restructured, none of the tools that use the value have to change the content to which it is applied.

One more usage point that is important to note: it is becoming more common and more useful to assign metadata to granular chunks of content within DITA topics, for example:

as a fragment of information to supply an answer in a conversational interface.
as a component of a web page marked up with the Schema.org vocabulary for more informative and relevant search engine results.

These small chunks of content may be described as “microcontent”, although that term is also used in other senses.

Two ways that external metadata is used with DITA content

There are two prototypical use cases for external metadata with DITA:

Using values from an external taxonomy or controlled list. These values could be applied as descriptive metadata on any DITA object or element: a map, a topic, a block element, or an inline element. As noted above, the best practice is to use a unique identifier rather than a readable label to refer to the value. These days, that identifier is often in URI format.
Expressing DITA elements as metadata fields from an external schema or ontology. Often, external schemas are mapped to metadata such as the prolog elements or even semantic body-content elements such as the steps in a task. These mappings are implicit — buried in implementation mechanisms such as the DITA-OT (for example the mappings to Dublin Core elements). It is not easy for external systems to understand and make use of these mappings. In a Linked Data context, this means that DITA content is not understood as truly semantic.

There are challenges in using DITA structures with both of these use cases. The sections below discuss the requirements and possible options.

Working with RDF-based taxonomies

Some common general requirements for working with taxonomies in DITA are:

They should be applicable to any DITA object or element, not just whole topics or maps. (See the “microcontent” usage examples above.)
The values should travel with the objects that they are applied to. This rules out the use of Classification Maps, for which each topic may need identical metadata applied for each map that it is used in. (By far the most common applications of metadata are those where the descriptions remain constant for an object regardless of the context that it is used in. Clearly, if it is used in a map that has its own context such as the intended market or audience, the metadata should be applied on the whole-map level and understood to apply to the referenced topics by inference.)
The effective values applied to content should be the unique identifiers (in RDF-based taxonomies, the URIs), but it should be feasible for authoring tools to allow authors to work with the readable labels for those identifiers.

In addition, there is an expectation or at least a hope that content that is authored using one approach (for example Subject Scheme-controlled attributes) should be easily exchangeable with environments that use another, for example direct integration of a taxonomy management tool with an authoring tool.

General approaches: conreffed prolog/topicmeta elements, or Subject Scheme-controlled attributes

It is possible to use the conref mechanism to control values for metadata in topic prolog elements and map topicmeta elements. The generally good support for conref in authoring tools means that this can provide quite a usable experience for authors. However, the limitations are that:

It is not feasible to apply this metadata within other block and inline elements. It only really works at the whole-topic or whole-map level.
It is difficult to apply unique IDs while still allowing authors to work with friendly labels. There is a pragmatic workaround, created by François Violette from Talend, to replace friendly labels with URIs at publish time using the conref push mechanism. However, this is not really idiomatic DITA use and will be opaque to many external applications.

A more current and idiomatic approach, and certainly one that DITA users are guided to by the standard, is to use Subject Scheme to control attribute values that represent taxonomical data. The relevant attribute is bound to a hierarchy of controlled values via the binding mechanism in the <enumerationdef> element. Authoring tools can then use this information to provide a UI for authors to apply the appropriate attribute value by picking the associated label. (At least, they can, although it seems that only Oxygen XML has implemented this so far, and for that the best support is specifically for profiling attributes.)

However, this approach presents problems when working with SKOS or other RDF-based taxonomies. RDF uses URI-format identifiers, typically URLs although they can be URNs. URLs all use the slash character (/), and URNs are allowed to use it. This character is prohibited in DITA key values because it is already used as the separator between key names and element IDs (for example in conkeyref). Allowing it in key values, although the most direct solution, would require profound changes to the key architecture; changes that would likely impact all vendors and all users of keys.

A workaround is to at least associate the URI with the key by putting it in an @href attribute on the <subjectdef>, like this:

<subjectdef keys="123" format="rdf:resource" scope="external"

href="">

<topicmeta>

<navtitle>Tractors</navtitle>

</topicmeta>

</subjectdef>

The href is supposed to include more information about the resource. This usage is semantically appropriate, because the best practice for Linked Data URIs is to make them dereferenceable — to put some helpful information at the target of the URL. (Via content negotiation, this can be both machine-readable RDF and human-readable HTML.)

However, this presents a problem of interoperability with environments where the authoring tool is integrated directly with a taxonomy management tool, for example in semi-automated classification scenarios. These environments will typically embed the URI in content rather than using a Subject Scheme map to look up keys. They have no easy way to map the keys to the associated URIs and vice versa.

There is a further issue in that there is no obvious, unspecialized attribute to hold the metadata values, whether URIs or not. @outputclass is too general, publishing-focused and arguably already overused. @props is specifically for specializing new conditional profiling attributes. @base is also intended for specialization only. What seems to be missing is some kind of a generic “taxonomy” or “metadata” attribute that could be used by:

the many DITA users who feel that specialization is too risky / expensive or whose tools don’t adequately support it.
tool vendors who want to demonstrate support for DITA taxonomy mechanisms and who do not want to put users off by assuming the use of a specialization.

In addition, content interchange, particularly in Linked Data contexts would be easier if there were a core attribute of this nature.

In the context of discussing these limitations in a DITA TC call on 2017-11-14, Eliot Kimber suggested two architectural modifications with low impact to the spec. The following subsections discuss these modifications.

Adding enumeration-value element to Subject Scheme

To store the effective unique ID value alongside the key and readable value, an <enumeration-value> element could be added within <subjectdef>. The value would then be made available to insert on bound attributes, in the same way that the key value is currently. In the Subject Scheme map, the structure would look like this:

<subjectdef keys="123">

<enumeration-value>http://example.com/Taxonomy/123</enumeration-value>

<topicmeta>

<navtitle>Tractors</navtitle>

</topicmeta>

</subjectdef>

This would solve the problem of effectively using URIs, at the cost of a little added complexity and with the proviso that users would need to come up with their own ways of generating unique key values either automatically or by hand. It doesn’t seem to be within the remit of the DITA spec to define any particular system for this.

Adding a new global metadata attribute that accepts URIs and the grouping mechanism

As noted above, there is no obvious attribute for storing taxonomy values. Eliot Kimber suggested adding such an attribute; one that could take multiple values and also worked with the grouping syntax found on the current @props attribute (allowing for pseudo-specialization in tools that support controlled values using this syntax). Usages could look like this:

values="http://example.com/taxonomy/123"

or, using the grouping syntax, this:

values="market(emea) product(acme-tractor)"

Two questions (apart from the obvious ones of whether this makes sense, is a worthwhile endeavor and is the best way to handle things):

Should such an attribute allow non-URI values, such as other formats of unique ID? I would imagine so but I’m not sure whether that was Eliot’s intention and what others think.
What should the attribute be called? “metadata”? “taxonomy”? “values”? Something else?

Understanding DITA elements in relation to external metadata schemas

DITA architects often seek to associate DITA’s range of semantically-named elements with the fields and container types of external metadata schemas. Where suitably named or placed elements are not available, they may specialize their own.

Two contemporary examples are the Intelligent Information Request and Delivery Standard (iiRDS) and Schema.org. Both are vocabularies based on RDF Schema. (In this usage, “vocabularies” means lightweight ontologies: classes and properties that are associated with content. It is only usually the term “controlled vocabulary” that signifies taxonomy.)

iiRDS is really a specification for metadata that accompanies content objects rather than being embedded in them. However, some of the metadata it specifies has parallels in DITA elements. It is quite feasible to transform some DITA elements to their corresponding iiRDS structures during publishing, resulting in the specified content + metadata package. This is exactly what members of Parson (Marion Knebel and Mark Schubert amongst others) and Empolis (Martin Kreutzer) did in preparation for presentations at the Tcworld 2017 conference.

In searching for semantically appropriate mappings, they decided to specialize in order to make a clear distinction between some of the elements to map to iiRDS (such as <iirds-product> and <information-subject>) and their more ambiguous out of box DITA equivalents (<product> and maybe <category>). Nevertheless, these specialized elements still cannot be understood in in a Linked Data setting until they are transformed into iiRDS metadata using the customized DITA-OT plugin that the team developed. This is not to downplay the valuable work that the team has done, but rather to point out that there is no mechanism in DITA to associate elements with their equivalents in external schemas in an unambiguous, machine-readable way.

Schema.org is an extremely popular vocabulary for marking up web content. It is popular because major search engine providers such as Google, Yahoo, and Bing back the initiative and use the resulting markup to understand better the semantic content of pages (and hence rank them better for relevant queries). They also use it in richer search results, for example to include recipe steps, timings, and images directly in the results page for recipe-related queries.

Schema.org has equivalent entities for much of the metadata that we also found in standards such as FOAF and Dublin Core. However, it also has entities and properties that correspond more to some of DITA’s semantic inline elements such as <step> and <cmd>. An obvious use case for DITA in a web publishing environment (particularly a commercial one) is to map these semantic elements to their Schema.org markup equivalents. There is no point in defining these things twice. However, again there is no unambiguous way to do so — we may develop mappings as part of DITA-OT or other publishing transformations, but there is no presentation-neutral way to simply state that an element is the equivalent of another in an external schema. Nor is there a way to share such mappings easily or use them to allow Linked Data tools to parse the content.

Once again, it is not that DITA should add new core elements or even proactively specify semantic equivalents for major schemas. But it could be useful to have a way to allow users to specify these equivalents.

Two ways could be:

using an existing add-on syntax (RDFa) to specify relationships
specifying relationships indirectly, not embedded in each instance of each relevant element

The final two subsections discuss these options.

RDFa (Lite)

RDFa is a way to embed RDF data in normal HTML or X(HT)ML content, using a number of attributes. For example, the equivalent of a DITA <cmd> element could be expressed in RDFa syntax using the Schema.org vocabulary in this way:

<p vocab="http://schema.org/" typeof="HowToDirection">Start the tractor.</p>

Of course, adding equivalent attributes to each usage of a DITA <cmd> element would be rather redundant, since that element already has the sense of a command or direction:

<cmd rdfa-vocab="http://schema.org/" rdfa-typeof="HowToDirection">Start the tractor.</cmd>

RDFa was designed primarily to mark up the less semantic elements in HTML, so some redundancy is to be expected. In addition, since it was expected that the markup might be hand-coded, there are a number of mechanisms (such as @prefix and the inheritance rules around @resource and @vocab) to reduce verbosity. This they do, but they also add complexity, for authors and also for implementors.

RDFa usage seems to be declining — in the end there was never much handcoding of markup done, nor sophisticated authoring tools to add it automatically. The complexity of parsing RDFa, amongst other factors led to the development of a JSON-based serialization of RDF, JSON-LD. JSON-LD has proved more popular with developers, although it is less useful for our purposes since it is not embedded in the relevant elements.

The DocBook standard has included RDFa Lite in the latest version. Although this is simpler than full RDF, it is still complex to parse and, given the redundancy issues as well, adoption might be slow if it were added to the core DITA standard.

Semantic mapping idea

One possibility to allow a DITA-sanctioned, open way of understanding elements as their external metadata equivalents (whatever those may be) is to specify a mapping mechanism. Perhaps in the same way as Subject Scheme maps are now linked from root maps, and maybe even as an extension to Subject Scheme, a DITA file could pair the names of DITA elements, specialized or not, with data on their equivalents.

This would minimize redundancy, as each mapping would only be stated once. It should have a minimal impact on editing tools and author experience, since regular topics and maps would remain unchanged.

To provide for complete metadata statements in the context of standards such as Schema.org, each element would need to be mappable to multiple external entity types. If it were RDF-specific, the entity types would be rdf:type (for classes) and rdf:property (for properties of those classes. In addition, for each entity type there should be a way of referencing the actual entity concerned. For example, the full mapping from <cmd> to Schema.org might look something like this:

<semantic-equivalents>

<semantic-equivalent>

<dita-element>cmd</dita-element>

<semantic-relationship>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</semantic-relationship>

<semantic-object>http://schema.org/HowToDirection</semantic-object>

<semantic-equivalent>

<dita-element>cmd</dita-element>

<semantic-relationship>http://www.w3.org/1999/02/22-rdf-syntax-ns#property</semantic-relationship>

<semantic-object>http://schema.org/itemListElement</semantic-object>

<semantic-equivalent>

<semantic-equivalents>

(The mapping of the same element to both an rdf:type and an rdf:property is because of the way that Schema.org relates entities to their containers.)

Clearly, there could be many ways in which to encode such a mapping, and this is just one idea.

Two questions would be:

In most cases, it might be sufficient that the text content of elements is the object of the metadata statements (the triples) that relate to it. However, in the case of an <image> or similar media container, it would be very useful if the value of the @href were used instead. How easy and practicable would it be to make this a default behavior? Could all elements with an @href be understood use the value of that attribute as the object?
This system provides no way to associate a URI with a particular element and its content. Sometimes such identifiers are used by graph databases such as MarkLogic or other more conventional CMSs. It is not necessary for general RDF use, however — it is perfectly acceptable to have the content that is the subject as a blank node.

Note that there is no suggestion to model the whole of an external schema in DITA or for enforcing its constraints. That goes beyond what the DITA standard should define. However, it might be helpful to use this kind of simple mapping system to make DITA content more interoperable and, as the Linked Data community would see it, more truly semantic.

dita message

Notes on support for working alongside external metadata in DITA

Two ways that external metadata is used with DITA content

Working with RDF-based taxonomies

General approaches: conreffed prolog/topicmeta elements, or Subject Scheme-controlled attributes

Adding enumeration-value element to Subject Scheme

Adding a new global metadata attribute that accepts URIs and the grouping mechanism

Understanding DITA elements in relation to external metadata schemas

RDFa (Lite)

Semantic mapping idea