structure, semantics, and simplified topic types

dita message

        > ..., the
        > semantics of the existing concept, task, and reference topic
        > types can be applied more or less easily to business content.

Conversely, when the only tools you have are these three base topic types, then everything looks like a concept, task, or reference. (Paraphrasing Abe Masloff's "When the only tool you have is a hammer, everything looks like a nail.")

Take a step back for a moment. Language is structured. If it were not, it could not "convey" meaning. A natural-language information parser (http://mlp-xml.sourceforge.net/) can process that inherent structure. [More common NLP systems only parse syntax and typically map compliant parse trees to RDBMS queries, and the results of the latter are rather confusedly thought of as the semantics of the former. They of course have their own 'semantic interpretation' in turn....]

Form and information are two aspects of the same thing. That is to say, constraints on combinability correlate with what makes sense to say about what. Language users bring their own associations to that information/structure. The ramifications of human associative memory, then, are another and much more unruly aspect of meaning. 'Meaning' in this vaguer sense piggy-backs on linguistic information.

Our XML tagging captures a small part of this. [Note that it can capture more; the MLP system cited above uses XML for data manipulation.] Our DITA markup, and XML/SGML generally, makes some limited aspects of linguistic information (the structure and semantics that is immanent in language) accessible to computers. Insofar as the tags have mnemonic names, they also make that structure more accessible and usable for humans, as a matter of human associative memory. The semantics that we capture with DITA element declarations is at a quite general level, and distinct from the semantics of those mnemonic labels, understood as English vocabulary.

To be more concrete, <task> (or <taskbody>) contains <steps> comprising a series of <step> elements. These structural constraints redundantly (but machine-readably) indicate structure that is already present in the untagged natural language content, structure which is one contributor to the overall semantic import of that content. That's the first sort of semantics. The element names "task", "step", etc. are meaningful to human users of DITA and provide guidance to them as they create content, tag existing content, and make decisions about reusing that content, processing it, etc. That's the second sort of semantics. DITA elements encode both of these kinds of semantics. Values of attributes like @platform encode that second kind of semantics only.

We recognize the two kinds of semantics in the distinction between structural specializations and domain specializations. The second sort of semantics is especially relevant for usability by content creators.

        > From: Rob Hanna
        >
        > Topic alone contains no semantic markup at all.

Figuring out what this might mean is what started me on this discussion. I read this as a claim that <topic> has no markup of the first kind, that the elements and attributes in <topic> are used to encode only semantics of the second "metadata" kind. In this view, some structure is recognized to correlate with the information in language (by labeling it explicitly), and other labeled structure is not. For example, the "related-in-sequence" semantics of <ol> or the "related" semantics of <ul> is of too general a kind to have been recognized as an aspect of the meaning in any natural language content so tagged.

        On 8/24/09 11:44 AM, Rob Hanna wrote:
        > Eliot kimber wrote:
        >> In traditional publishing content, such as trade books or novels or
        >> magazines, the distinction between "concept" and other stuff is not
        >> one that is generally recognized or useful.
        >
        > Traditional publishing content is not topic-based nor is it
        > semantically-structured.

All natural language content is semantically structured. And it talks about one "thing" at a time. Each such "thing" is the topic of the discourse at that point. So how can it not be organized into topics? The questions are, what are those topics (boundaries between them), how do their creators and users categorize them into topic types, what are their structures (internal components and constraints on the combination of same), and what labels for these components are mnemonically useful for the creators and managers of such content. And another question is, are these topical units of content subject to reuse. Back to that in a moment.

Tim, you might say that all these topics can be appropriately tagged as <concept>, <task>, and <reference>, though I think that in the following you're not talking about the same sort of content as Eliot has been working with:

        > From: Tim Grantham:
        > Speaking from [my] own
        > experience authoring hundreds of documents of many different
        > types, including mainstream business document types, I have
        > yet to find one that could not be modelled semantically, at
        > high level, as one or more concept, task, or reference
        > topics.

In addition, the aggregation of topics into larger constructions is going to be more complex and subtle in things like trade books and novels. Bookmap has provision for front matter and back matter, but <part> doesn't get at the internal organizational structure that publications often have.

Back to the question whether these topical units of content in publications are subject to reuse. Reuse is an issue in many classes of business documents. The answer so far is that although users of this content have not seen reuse as a desideratum they are surprised and gratified (read: motivated to adopt DITA) when the nature and deployment of their content is actually investigated. Redundancy is reinterpreted as content sharing. Revision, localization & translation, and other variation has much the same complexion as reuse. Just look at http://safaribooksonline.com (formerly SafariU.com) for radical reuse of publication content.

But I'm jumping ahead here. Let's wait for the BusDoc team's proposal. I just wanted to urge a little more care in our conceptualization of structure and semantics.