dita message

Subject: Brainstorming metadata

From: David Hollis <dhollis@aandoconsultancy.ltd.uk>
To: DITA TC <dita@lists.oasis-open.org>
Date: Fri, 24 Feb 2017 09:26:50 +0000

Apropos recent threads related to bookmap design and metadata standards, I thought I'd brainstorm some thoughts re. metadata. Metadata discussions appear in multiple contexts.

Fundamentally for DITA, there are two types of metadata:

1. Metadata related to content creation

2. Metadata related to content dissemination

Creation metadata examples:

Related to workflows
Related to products and versions
Related to personnel
Dates are important

Workflow metadata examples:

For terms: proposal, acceptance/rejection, reasons
For topics: creation, edit, review, revise
For translation: required, sent, returned, reviewed

Dissemination metadata examples:

Search engine optimisation
Visible tags for blog and wiki content management
Hidden tags for feature searches
Nothing related to personnel
Dates possibly less important

These are not complete lists, but are hopefully useful as starting points.

A processor should kill off creation metadata, and pass dissemination metadata. For privacy reasons, disseminated content should not include anything that could identify personnel. The DITA standard cannot impose this, but should make it abundantly clear as to the intentions.

For metadata to be of any use, it has to be easily searchable. Ideally all metadata would be at the XML content level, not the attribute level. If metadata were at the attribute level, to find it would require a search on attributes. Is it reasonable to expect an author to know or remember to use an attribute search? Does a blog, wiki or external search engine operate at the attribute level? I don't actually know the answer to this, but I rather doubt it. OK, a processor might process an attribute to actual content. To keep the model simple, it would be easiest to have all metadata at the XML content level.

For the same reasons of searchability, creation metadata ideally should not use DITA reuse mechanisms. They simply get in the way. An element that references content via an attribute typically has no content of its own, and so a simple content level search will not find it. Instead of DITA reuse mechanisms, templates and CMS saved searches are very useful.

However, dissemination metadata can use any DITA reuse mechanism because the processor will resolve them, and they will not be present in the output.

To guarantee search results, metadata tags might use code words or 'non terms'. That is, a term that is deliberately misspelt, or mis-punctuated. This can guarantee search results. By definition, these terms should only appear in metadata, not in body content. Tag a topic metadata with a code word or non term, and then search for that term, guarantees that the search results will include all tagged topics. There should be no confusion with other topics, because other topics should not have the non term in their body content. Likewise, it does not rely on body content that might or might not include a particular term.

Examples of code words or non terms:

Modified product names that might remove a hyphen, for example
Concatenation of family and product names, or product name and revision
Colours for a product feature: blk, blu, brn, wh, rd, gn, or, plp, ylw

Non terms are recognisable and understandable, but are still 'wrong'. This could lead to a dictionary or glossary of metadata non terms, or at least a list of acceptable non terms for use as metadata tags. Note: it has to be left to the author, with software assistance, to place non terms in creation metadata as real text, and not rely on referenced reuse. Templates and CMS saved searches are very useful in this regard, and reduce reliance on authors who are bound to get it wrong on occasion.

Ideally, the DITA metadata model would have very generic roots to specialise into actual requirements. For example, the Dublin Core standard would be a dissemination metadata specialisation.

A generic metadata model might include the following:

Single term
Multiple delimited terms
Phrases
Sentences and paragraphs
Dates
One of many choices
Many from many choices

It should be possible to specialise such a model into any specific metadata requirement. A company ought to generate its own metadata specialisations to meet its specific needs, and the DITA standard should somehow encourage this. With assistance from tools.

I am not convinced that the DITA standard should attempt to meet every single requirement for metadata. I think that is impracticable. However, I think it should highlight the differences between creation metadata and dissemination metadata.

The '6 million dollar question' then becomes which metadata specialisations to include in the OOB DITA standard? Whilst useful in their own right, they should also be an example for what is achievable.

For instance, if the OOB DITA standard were to include a specific metadata model, then an author would probably try to fill as many of the fields as possible. This would be a complete waste of time if the disseminated content does not actually use that metadata model.

I hope this is helpful. Caveat: I don't have a particular axe to grind. This is simply brainstorming.

David Hollis