OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

dita-busdocs message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: RE: [dita-busdocs] Groups - Classification of Business Documents (Classification_20080114_r2.ppt) uploaded


Dear Michael,

Thanks for circulating the draft classification analysis.

I have some comments which I have set out by reference to some of your
slides:

Slide Introduction 1
-------------------
Agree


Slide Introduction 2
--------------------
Agree that structural characteristics are important. There are many
different types of business documents and we will not be able to classify
them all by semantics.


Slide Potential Structural Characteristic to consider when classifying
----------------------------------------------------------------------
Agree with "Is it narrative?" but I am not sure about many of the other
characteristics. The presence of tables or graphics is relevant but most
documents can be assumed to include them. Table frequency seems hardly
relevant. One complex table opens up all issues about tables.

I offer for consideration a model that may be used define the structure of
narrative documents.

1. Is it a narrative document?
* Is it mainly written description or analysis or similar?
* There may be some cases where documents are mixed narrative and structured
data eg form content or database records.

2. What are the high level document semantics and structures?
Is the document just a sequence of chapters, sections, paragraphs or
something else?

For example, does the document have major semantic divisions such as (for a
business report):
  Executive summary
  Main content
  Schedules/Attachments

Note: The main content may have a structure similar to the next example.

Or, in say, Ann's Use case model:
  Scenario
  Goals
  Problem
  Content types
  Solution

Note: It would be possible to represent this example in two ways. One is to
use distinct elements for each section of the document. The other is to use
a generic element (say topic) for each subject and add metadata if it is
necessary to further define its semantics. Working out which approach to
take is one of the goals of document analysis.

3. Is it necessary to model these high level semantics (section 2) in the
markup?
* Are the content structure patterns different in those objects?
  For example, should the executive summary be confined to just paragraphs
and not permit topic or section structures or is it just another container
similar to the main part of the document?
  Another example, do the schedules have a distinct pattern of components
and need to be distinguished so they can be given appropriate numbering? 
* Do we need to process the document structure components as distinct
objects for information retrieval or rendering purposes?
  For example, will someone want to extract the scenarios from a collection
of use cases and index them to the documents?

Note: See the discussion on content structure patterns at the end of this
message.

4. What are the main content structure patterns used in the document?
* What are the main document subdivisions (chapters, parts, sections etc)
* Are these components arranged hierarchically?
* How is the document hierarchy defined?
* Is it based on recursive use of the same pattern or are the content
structure patterns different at each level?
  Thus, is it necessary to define distinct elements at each level or can one
element be used recursively?
* If it is necessary to limit the depth of hierarchy, it may be necessary to
use distinctly named elements at each level.

Please refer to the later discussion on content structure patterns.

5. Can the document make use of shared components (transclusions) and if so,
at what levels in the hierarchy can this occur?

6. What components break the basic hierarchical pattern of narrative content
in the document and what patterns are needed to model those components?
[I call these "inclusions".]
* Common inclusions are tables and graphics.
* Other common inclusions are quotations, annotations, examples etc.
* Sometimes these types of inclusions are central to the document structure
such as Tasks in DITA. However, more often in business documents they are
subsidiary to the main document hierarchy.
* At what levels in the document hierarchy do inclusions occur?
* Are these inclusions part of a paragraph or alternatives to a paragraph?

7. What semantics are used within paragraphs?
Are there particular words or phrases that have distinct semantics that must
be captured for rendering or other purposes?
The classes of inline semantics may be different for different types of
documents, say between scientific or technical documents and legal
documents.
 
8. Are tables used and what is their complexity?
* Do tables include complex content within cells? If so what is the pattern?
Some table cells may include content conforming to a topic pattern, others
may have basic paragraph text while others may only include simple numeric
data.
* What control over table layout do writers require? Can all tables be
rendered using a single layout or must users be able to specify the use of
different layouts for particular tables?

9. Specific requirements for graphics/images


Other slides:
-------------
* Most Significant Characteristic
*  Qualifying Narrative Density

I submit that density and length of the document in particular are not
useful tools in document analysis. You can have a very short document with a
complex structure. I believe it is the content structure patterns that are
the key.



PM comment on content structure patterns
----------------------------------------
I believe it useful if we introduce a concept of "content structure
patterns".

A DITA topic is a content structure pattern, albeit a very loose one. A DITA
p is another content structure pattern. Because these patterns are rather
loose, it is hard to apply them in analysis.

I suggest that the content of most narrative documents can be analysed using
some simple patterns. While we are working with DITA, I think that using
DITA as a pattern template will initially make life difficult for us.
However, we will have to eventually map everything to DITA patterns and work
out where there are problems.

I consider that DITA has a very artificial model because of the use of the
map file and topics to create a document.

If we analyse many business documents you will find a hierarchy of content
groups arranged hierarchically. It may look like this example:

1. Executive summary
[One or more paragraphs]

2. Problems with customer's systems
2.1 Content management
[One or more paragraphs]

2.2 Web publishing
[One or more paragraphs]

.....

3. Customer's objectives
[One or more paragraphs]

4. Proposed solution
4.1 Content management
4.1.1 Document storage
[One or more paragraphs]

4.1.2 Content authoring
[One or more paragraphs]

.....

4.2 Web publishing
.....

From this example, it could be argued there is a recursive hierarchy of
information groups (Lets use the DITA concept "topic").
In DITA, these topics are not strictly recursive because they are linked
into a map that defines the hierarchy.

The DITA model also uses section that can occur at one level within a topic.
Whether something should be marked as a topic or a section is a difficult
choice. In DITA thinking, it is whether the unit of content is a standalone
concept. More generally, for many writers, it probably revolves around
whether the writer intends the object to be included in a numbering sequence
or a contents listing.

Without DITA, I would suggest that the pattern in the example above is a
recursive hierarchy of topics. Theoretically, the patterns to model such a
hierarchy could be (using DTD syntax):
Option A  (title, (paragraph+ | topic+))  [Either paragraphs or topics but
no mixing]
Option B   (title, (paragraph*, topic+))  [Optional paragraphs preceding the
topics]
Option C   (title, paragraph | topic)*)  [arbitrarily mixed paragraphs and
topics]

The earlier example fits Option A.

It is possible to broaden this pattern and make the title optional. If so,
it becomes possible to model the clause and subclause pattern found in many
legal documents. In clause/subclause patterns, the subclause may or may not
have a title. 

If we take DITA as a given and immutable set of patterns, we will probably
ignore this because DITA has its own answers to these things. DITA permits
case B out of the box within a topic. Within a map, it just permits
topicheads/topicrefs or links to topics. I am not sure if options A or C are
possible by specialization of topic. I suspect not. Whether there is value
in constraining a pattern to option A is an issue. Option C probably should
never be permitted, even though examples exist in real life.

DITA does not permit an optional title on topic so clause/subclause patterns
in legal documents are a problem for DITA. 

If we put DITA to one side, the listed patterns do provide a way to analyse
content structure patterns in narrative documents. It is then possible to
look at any given semantic structure and ask:
* Can it be modelled using one of the known patterns or must a new pattern
be devised for it? This will determine whether a generic container such as
topic can be used.
* Even if it can be modelled using a generic pattern such as topic, do the
semantics or processing (rendering) requirements dictate that a distinct
element should be created or can its semantics be adequately defined by
metadata? 

Another aspect of content structure patterns is the paragraph model. DITA
has a model for p.
What objects are permitted within the paragraph:
* lists?
* tables?
* graphics?

What objects are permitted as siblings of the paragraph?

Some of the answers to these questions are highly subjective and can be
controversial. It is very difficult to discern a hierarchy for a paragraph
and these objects looking at printed or word processing documents. It has to
be inferred from punctuation and context.
 

Regards

Peter Meyer




> -----Original Message-----
> From: mboses@invisionresearch.com 
> [mailto:mboses@invisionresearch.com] 
> Sent: Tuesday, 22 January 2008 3:55 AM
> To: dita-busdocs@lists.oasis-open.org
> Subject: [dita-busdocs] Groups - Classification of Business 
> Documents (Classification_20080114_r2.ppt) uploaded
> 
> Please review Rev 2 of this document which was discussed in the
> subcommittee meeting today. Comments are requested. Please 
> either email
> comment documents to Michael (if you make comments in the PowerPoint
> itself) or upload comments using the comments feature of the 
> OASIS site.
> 
>  -- Michael Boses
> 
> The document revision named Classification of Business Documents
> (Classification_20080114_r2.ppt) has been submitted by 
> Michael Boses to the
> DITA Enterprise Business Documents SC document repository.  
> This document
> is revision #3 of Classification_20080114.ppt.
> 
> Document Description:
> This document contains the initial ideas of the subcommittee 
> concerning the
> characteristics that will be used to classify narrative 
> business documents,
> both to define the scope of the subcommittee and the methods 
> that will be
> used to analyze documents.
> 
> View Document Details:
> http://www.oasis-open.org/apps/org/workgroup/dita-busdocs/docu
> ment.php?document_id=26874
> 
> Download Document:  
> http://www.oasis-open.org/apps/org/workgroup/dita-busdocs/down
> load.php/26874/Classification_20080114_r2.ppt
> 
> Revision:
> This document is revision #3 of Classification_20080114.ppt.  
> The document
> details page referenced above will show the complete revision history.
> 
> 
> PLEASE NOTE:  If the above links do not work for you, your 
> email application
> may be breaking the link into two pieces.  You may be able to 
> copy and paste
> the entire link address into the address field of your web browser.
> 
> -OASIS Open Administration



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]