[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Notes on working with external metadata in DITA
Apologies for the delay in completing my action item from the meeting on the 14th: “Joe will write up notes on RDFA from meeting”. Thanks to Nancy for the detailed and speedy work on the minutes from that call.
I wanted to get my facts right, and I had another suggestion to add, so I did some further research and experimentation to pull the notes together. Also, I initially wrote a much longer piece with a lot on metadata basics, terminology, useful links, and some illustrations — I think it’s too much to expect the TC to wade through although I will make it available on my blog so that anyone, including non-TC members, could perhaps understand metadata in DITA a little better, or add their comments.
Anyway, here’s what I have — I hope it’s useful and helps to spark some further discussion and wider participation concerning this important area. (I could also send a Word or PDF version if that would be helpful — I’m not sure whether the list accepts attachments.)
Notes on support for working alongside external metadata in DITA
These notes are about improving interoperability between DITA and external tools, standards, and vocabularies for working with metadata. The intention is not to add support for any one vocabulary, as such things constantly change and that is not DITA’s domain. Rather the intention is to better support the frameworks behind such vocabularies.
The National Information Standards Organization (NISO) lists some “Notable Metadata Languages: Examples in Broad Use” (http://www.niso.org/apps/group_public/download.php/17446/Understanding%20Metadata.pdf)
Most of these languages are based on the RDF framework, or at least have a published version that uses it. In this way they support a Linked Data approach, where metadata containers and data are unambiguously identified by, and can be linked effectively across different systems and organization- or industry-specific usages, without the need for proprietary mapping mechanisms.
Also, the Simple Knowledge Organization System (SKOS), a W3C standard for taxonomy and thesaurus data, uses the Linked Data approach. SKOS is very common, and in some cases mandatory, in the European public sector, as well as being widely used elsewhere. In SKOS, each “concept” (all readable labels. synonyms, and translations, that represent the same idea or term in a taxonomy) is identified by a URI.
Regardless of RDF and Linked Data, it is a best practice to use unique identifiers for controlled vocabulary values — identifiers that are distinct from the readable label/s for those values. This enables robust integration between systems such as CMSs with faceted search features, CCMSs, taxonomy management tools, and other data sources. It means that whenever a label for a value is changed or its relationships restructured, none of the tools that use the value have to change the content to which it is applied.
One more usage point that is important to note: it is becoming more common and more useful to assign metadata to granular chunks of content within DITA topics, for example:
These small chunks of content may be described as “microcontent”, although that term is also used in other senses.
Two ways that external metadata is used with DITA content
There are two prototypical use cases for external metadata with DITA:
There are challenges in using DITA structures with both of these use cases. The sections below discuss the requirements and possible options.
Working with RDF-based taxonomies
Some common general requirements for working with taxonomies in DITA are:
In addition, there is an expectation or at least a hope that content that is authored using one approach (for example Subject Scheme-controlled attributes) should be easily exchangeable with environments that use another, for example direct integration of a taxonomy management tool with an authoring tool.
General approaches: conreffed prolog/topicmeta elements, or Subject Scheme-controlled attributes
It is possible to use the conref mechanism to control values for metadata in topic prolog elements and map topicmeta elements. The generally good support for conref in authoring tools means that this can provide quite a usable experience for authors. However, the limitations are that:
A more current and idiomatic approach, and certainly one that DITA users are guided to by the standard, is to use Subject Scheme to <enumerationdef> element. Authoring tools can then use this information to provide a UI for authors to apply the appropriate attribute value by picking the associated label. (At least, they can, although it seems that only Oxygen XML has implemented this so far, and for that the best support is specifically for profiling attributes.)that represent taxonomical data. The relevant attribute is bound to a hierarchy of controlled values via the binding mechanism in the
However, this approach presents problems when working with SKOS or other RDF-based taxonomies. RDF uses URI-format identifiers, typically URLs although they can be URNs. URLs all use the slash character (/), and URNs areuse it. This character is prohibited in DITA key values because it is already used as the separator between key names and element IDs (for example in conkeyref). Allowing it in key values, although the most direct solution, would require profound changes to the key architecture; changes that would likely impact all vendors and all users of keys.
A workaround is to at least associate the URI with the key by putting it in an @href attribute on the <subjectdef>, like this:
The href is supposed to include more information about the resource. This usage is semantically appropriate, because the best practice for Linked Data URIs is to make them— to put some helpful information at the target of the URL. (Via , this can be both machine-readable RDF and human-readable HTML.)
However, this presents a problem of interoperability with environments where the authoring tool is integrated directly with a taxonomy management tool, for example in semi-automated classification scenarios. These environments will typically embed the URI in content rather than using a Subject Scheme map to look up keys. They have no easy way to map the keys to the associated URIs and vice versa.
There is a further issue in that there is no obvious, unspecialized attribute to hold the metadata values, whether URIs or not. @outputclass is too general, publishing-focused and arguably already overused. @props is specifically for specializing new conditional profiling attributes. @base is also intended for specialization only. What seems to be missing is some kind of a generic “taxonomy” or “metadata” attribute that could be used by:
In addition, content interchange, particularly in Linked Data contexts would be easier if there were a core attribute of this nature.
In the context of discussing these limitations in a DITA TC call on 2017-11-14, Eliot Kimber suggested two architectural modifications with low impact to the spec. The following subsections discuss these modifications.
Adding enumeration-value element to Subject Scheme
To store the effective unique ID value alongside the key and readable value, an <enumeration-value> element could be added within <subjectdef>. The value would then be made available to insert on bound attributes, in the same way that the key value is currently. In the Subject Scheme map, the structure would look like this:
This would solve the problem of effectively using URIs, at the cost of a little added complexity and with the proviso that users would need to come up with their own ways of generating unique key values either automatically or by hand. It doesn’t seem to be within the remit of the DITA spec to define any particular system for this.
Adding a new global metadata attribute that accepts URIs and the grouping mechanism
As noted above, there is no obvious attribute for storing taxonomy values. Eliot Kimber suggested adding such an attribute; one that could take multiple values and also worked with the grouping syntax found on the current @props attribute (allowing for pseudo-specialization in tools that support controlled values using this syntax). Usages could look like this:
or, using the grouping syntax, this:
Two questions (apart from the obvious ones of whether this makes sense, is a worthwhile endeavor and is the best way to handle things):
Understanding DITA elements in relation to external metadata schemas
DITA architects often seek to associate DITA’s range of semantically-named elements with the fields and container types of external metadata schemas. Where suitably named or placed elements are not available, they may specialize their own.
Two contemporary examples are the Intelligent Information Request and Delivery Standard (iiRDS) and Schema.org. Both are vocabularies based on RDF Schema. (In this usage, “vocabularies” means lightweight ontologies: classes and properties that are associated with content. It is only usually the term “controlled vocabulary” that signifies taxonomy.)
iiRDS is really a specification for metadata that accompanies content objects rather than being embedded in them. However, some of the metadata it specifies has parallels in DITA elements. It is quite feasible to transform some DITA elements to their corresponding iiRDS structures during publishing, resulting in the specified contentmetadata package. This is exactly what members of Parson (Marion Knebel and Mark Schubert amongst others) and Empolis (Martin Kreutzer) did in preparation for presentations at the Tcworld 2017 conference.
In searching for semantically appropriate mappings, they decided to specialize in order to make a clear distinction between some of the elements to map to iiRDS (such as <iirds-product> and <information-subject>) and their more ambiguous out of box DITA equivalents (<product> and maybe <category>). Nevertheless, these specialized elements still cannot be understood in in a Linked Data setting until they are transformed into iiRDS metadata using the customized DITA-OT plugin that the team developed. This is not to downplay the valuable work that the team has done, but rather to point out that there is no mechanism in DITA to associate elements with their equivalents in external schemas in an unambiguous, machine-readable way.
Schema.org is an extremely popular vocabulary for marking up web content. It is popular because major search engine providers such as Google, Yahoo, and Bing back the initiative and use the resulting markup to understand better the semantic content of pages (and hence rank them better for relevant queries). They also use it in richer search results, for example to include recipe steps, timings, and images directly in the results page for recipe-related queries.
Schema.org has equivalent entities for much of the metadata that we also found in standards such as FOAF and Dublin Core. However, it also has entities and properties that correspond more to some of DITA’s semantic inline elements such as <step> and <cmd>. An obvious use case for DITA in a web publishing environment (particularly a commercial one) is to map these semantic elements to their Schema.org markup equivalents. There is no point in defining these things twice. However, again there is no unambiguous way to do so — we may develop mappings as part of DITA-OT or other publishing transformations, but there is no presentation-neutral way to simply state that an element is the equivalent of another in an external schema. Nor is there a way to share such mappings easily or use them to allow Linked Data tools to parse the content.
Once again, it is not that DITA should add new core elements or even proactively specify semantic equivalents for major schemas. But it could be useful to have a way to allow users to specify these equivalents.
Two ways could be:
The final two subsections discuss these options.
RDFa is a way to embed RDF data in normal HTML or X(HT)ML content, using a number of attributes. For example, the equivalent of a DITA <cmd> element could be expressed in RDFa syntax using the Schema.org vocabulary in this way:
Of course, adding equivalent attributes to each usage of a DITA <cmd> element would be rather redundant, since that element already has the sense of a command or direction:
RDFa was designed primarily to mark up the less semantic elements in HTML, so some redundancy is to be expected. In addition, since it was expected that the markup might be hand-coded, there are a number of mechanisms (such as @prefix and the inheritance rules around resource and @vocab) to reduce verbosity. This they do, but they also add complexity, for authors and also for implementors.
RDFa usage seems to be declining — in the end there was never much handcoding of markup done, nor sophisticated authoring tools to add it automatically. The complexity of parsing RDFa, amongst other factors led to the development of a JSON-based serialization of RDF, JSON-LD. JSON-LD has proved more popular with developers, although it is less useful for our purposes since it is not embedded in the relevant elements.
The DocBook standard has included RDFa Lite in the latest version. Although this is simpler than full RDF, it is still complex to parse and, given the redundancy issues as well, adoption might be slow if it were added to the core DITA standard.
Semantic mapping idea
One possibility to allow a DITA-sanctioned, open way of understanding elements as their external metadata equivalents (whatever those may be) is to specify a mapping mechanism. Perhaps in the same way as Subject Scheme maps are now linked from root maps, and maybe even as an extension to Subject Scheme, a DITA file could pair the names of DITA elements, specialized or not, with data on their equivalents.
This would minimize redundancy, as each mapping would only be stated once. It should have a minimal impact on editing tools and author experience, since regular topics and maps would remain unchanged.
To provide for complete metadata statements in the context of standards such as Schema.org, each element would need to be mappable to multiple external entity types. If it were RDF-specific, the entity types would be rdf:type (for classes) and rdf:property (for properties of those classes. In addition, for each entity type there should be a way of referencing the actual entity concerned. For example, the full mapping from <cmd> to Schema.org might look something like this:
(The mapping of the same element to both an rdf:type and an rdf:property is because of the way that Schema.org relates entities to their containers.)
Clearly, there could be many ways in which to encode such a mapping, and this is just one idea.
Two questions would be:
Note that there is no suggestion to model the whole of an external schema in DITA or for enforcing its constraints. That goes beyond what the DITA standard should define. However, it might be helpful to use this kind of simple mapping system to make DITA content more interoperable and, as the Linked Data community would see it, more truly semantic.
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]