linguistics and DITA Business Docs

You have looked into some of the literature of narrative analysis and discourse analysis, and have been unable to find anything that directly applies to the "business documents" that you have examined.

You are looking at the literature of typography and book design to identify accepted concepts and terminology.

Both of these literature surveys seem to me to have the implicit purpose of enlisting some recognized authority to underwrite any proposal that we make for DITA-based markup of the structure of "narrative" as found in the business documents of interest.

You have asked for my input as a linguist. First, I have to concur in your assessment of the recent literature of discourse analysis. I am not able to find much that is germane to our purposes . For example, consider the list of "topics of interest to discourse analysts" found in http://en.wikipedia.org/wiki/Discourse_analysis:

* The various levels or dimensions of discourse, such as sounds (intonation, etc.), gestures, syntax, the lexicon, style, rhetoric, meanings, speech acts, moves, strategies, turns and other aspects of interaction
* Genres of discourse (various types of discourse in politics, the media, education, science, business, etc.)
* The relations between discourse and the emergence of sentence syntax
* The relations between text (discourse) and context
* The relations between discourse and power
* The relations between discourse and interaction
* The relations between discourse and cognition and memory

None of these connect with our interest in examining "narrative" texts of "business documents" and identifying those parts of them which are semantically distinct and structurally identifiable. (To say "semantically distinct and structurally identifiable" is pleionastic, if not redundant, BTW--form and information are inextricable.)

If you read that wikipedia article, you will see that the above list applies to relatively recent ideas about discourse analysis--since the 1970s and 1980s. The earlier form of discourse analysis is what I am most familiar with, as developed by Zellig Harris and his students beginning in perhaps 1938. His most famous student, Noam Chomsky, confesses that he never really understood it--and that goes far to account for the neglect of this approach. Harris's work culminated in 1989 in a demonstration of the form of information in science and in 1991 in a theory of language and information (refs in that wiki article).

The methodology is an extension of distributional analysis in linguistics. If two items (morphemes, words, phrases) each occur in the same context, there is to that degree a semantic and structural equivalence between them. This is the methodological basis for establishing grammatical categories for sentences in a language (verb, noun, etc.), and for establishing more fine-grained semantic subcategories. But beyond the grammar of sentences, local equivalence classes can be set up within a discourse, applying only within that discourse or within a set of like discourses. With the aid of paraphrastic transformations the successive periods of a discourse can be regularized so that the members of each equivalence class fall within columns of a table (binary array). Beyond that, discourses of a constrained subject matter have the same equivalence classes; changes of topic ("changing the subject") within a discourse correspond with changes of vocabulary and changes of the equivalence classes in which they fall; related subject-matter domains intersect in these particulars; terms are borrowed from one domain to another, necessarily with changes of the contexts in which they occur and hence of their equivalence classes and their meanings; the language of a restricted domain has a distinct sublanguage grammar and lexicon; "general usage" may be an envelope of sublanguages; and so on.

So deep a command of the semantics of discourse we do not require, and anyway it depends upon a degree of analysis that is impracticable for us or for users of DITA.

Nonetheless, the methods of linguistic analysis are relevant, I think, for sharpening and extending what we call "content analysis" in the development of a data model. I suppose what I need to do is look at the examples that you surveyed, with particular attention to questions and problems that you identified. For example, in our meeting on July 5, someone said that there is no name for a paragraph preceding a subsection or a paragraph following a subsection, and that this lack of settled nomenclature difficult to discuss. I would need to see examples, because I don't know what you mean.

The DITA base types are so loose and unrestricted that it seems possible to shoehorn almost anything into them. We should carefully examine the opposite path: when we try that, where are the gaps and where are the pinches? Do we want more semantic specificity?

Another potential resource has come to my attention. The WikiSlice project is looking to put Wikipedia articles into DITA topics.
http://wiki.laptop.org/go/Projects/Wikislice
It might be useful to see what they're running into with the "narrative" text of wikipedia articles.

dita-busdocs message