dita message
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
| [List Home]
Subject: structure, semantics, and simplified topic types
- From: "Bruce Nevin (bnevin)" <bnevin@cisco.com>
- To: "dita" <dita@lists.oasis-open.org>
- Date: Tue, 25 Aug 2009 17:26:59 -0400
This is probably premature given our necessary focus on
getting 1.2 out, so I'm fine with deferring any discussion
now.
From: Tim
Grantham
> ..., the
> semantics of the existing
concept, task, and reference topic
> types can be applied more or
less easily to business content.
Conversely, when the
only tools you have are these three base topic types, then everything looks like
a concept, task, or reference. (Paraphrasing Abe Masloff's "When the only tool
you have is a hammer, everything looks like a nail.")
There's some
obscurity and confusion in our talk about structure and semantics.
Take a step
back for a moment. Language is structured. If it were not, it could not "convey"
meaning. A natural-language information parser (http://mlp-xml.sourceforge.net/) can process that inherent structure. [More
common NLP systems only parse syntax and
typically map compliant parse trees
to RDBMS queries, and the results of the latter are rather confusedly thought of
as the semantics of the former. They of course
have their own 'semantic interpretation' in turn....]
Form and information
are two aspects of the same thing. That is to say, constraints on combinability
correlate with what makes sense to say about what. Language users bring their own associations to
that information/structure. The ramifications of human associative memory, then, are another and much more unruly
aspect of meaning. 'Meaning' in this vaguer
sense piggy-backs on linguistic information.
Our XML tagging
captures a small part of this. [Note
that it can capture more; the MLP system cited above uses XML for data
manipulation.] Our DITA markup, and XML/SGML generally, makes some limited aspects of linguistic
information (the structure and semantics that is immanent in language)
accessible to computers. Insofar as the tags have mnemonic names, they also make
that structure more accessible and usable
for humans, as a matter of human
associative memory. The semantics that we capture with DITA element declarations
is at a quite general level, and distinct from the semantics of those mnemonic
labels, understood as English vocabulary.
To be more concrete,
<task> (or <taskbody>) contains <steps> comprising a series of
<step> elements. These structural
constraints redundantly (but machine-readably)
indicate structure that is already present in the untagged natural language
content, structure which is one contributor to the overall semantic import of that content. That's the first sort of
semantics. The element names "task", "step", etc. are meaningful to human users
of DITA and provide guidance to them as they create content, tag existing
content, and make decisions about reusing that content, processing it, etc.
That's the second sort of semantics. DITA elements encode both of these kinds of
semantics. Values of attributes like @platform encode that second kind of
semantics only.
We
recognize the two kinds of semantics in the distinction between structural
specializations and domain specializations. The
second sort of semantics is especially relevant for usability by content
creators.
> From: Rob Hanna
>
> Topic alone contains no
semantic markup at all.
Figuring out what this might mean is what started me on
this discussion. I read this as
a claim that <topic> has
no markup of the first kind, that
the elements and attributes in <topic> are used to encode only
semantics of the second "metadata" kind. In this
view, some structure is recognized to correlate with the information in
language (by labeling it explicitly), and other labeled structure is not.
For example, the "related-in-sequence" semantics of <ol> or the
"related" semantics of <ul> is of too general a kind to have been
recognized as an aspect of the meaning in any natural language content so
tagged.
On 8/24/09 11:44 AM, Rob Hanna
wrote:
> Eliot kimber
wrote:
>> In traditional
publishing content, such as trade books or novels or
>> magazines, the
distinction between "concept" and other stuff is not
>> one that is generally
recognized or useful.
>
> Traditional publishing
content is not topic-based nor is it
> semantically-structured.
All
natural language content is semantically structured. And it talks about one
"thing" at a time. Each such "thing" is the topic of the discourse at that
point. So how can it not be organized into topics? The questions are, what are
those topics (boundaries between them), how do their creators and users
categorize them into topic types, what are their structures (internal components
and constraints on the combination of same), and what labels for these
components are mnemonically useful for the creators and managers of such
content. And another question is, are these
topical units of content subject to reuse. Back to that in a
moment.
Tim, you might say
that all these topics can be appropriately tagged as <concept>,
<task>, and <reference>, though I think that in the following you're
not talking about the same sort of content as Eliot has been working
with:
> From: Tim
Grantham:
> Speaking from [my]
own
> experience authoring
hundreds of documents of many different
> types, including mainstream
business document types, I have
> yet to find one that could not be modelled semantically, at
> high level, as one or more
concept, task, or reference
>
topics.
In addition, the
aggregation of topics into larger constructions is going to be more complex and
subtle in things like trade books and novels. Bookmap has provision for front
matter and back matter, but <part> doesn't get at the internal
organizational structure that publications often have.
Back to the question whether these topical units of content
in publications are subject to reuse. Reuse is an issue in many classes of
business documents. The answer so far is that although users of this content
have not seen reuse as a desideratum they are surprised and gratified (read:
motivated to adopt DITA) when the nature and deployment of their content is
actually investigated. Redundancy is reinterpreted as content sharing. Revision,
localization & translation, and other variation has much the same complexion
as reuse. Just look at http://safaribooksonline.com (formerly
SafariU.com) for radical reuse of publication content.
But I'm jumping ahead here. Let's wait for the BusDoc
team's proposal. I just wanted to urge a little more care in our
conceptualization of structure and semantics.
/Bruce
/Bruce
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
| [List Home]