[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: [dita-busdocs] Groups - Classification of Business Documents (Classification_20080114_r2.ppt) uploaded
Dear Michael, Thanks for circulating the draft classification analysis. I have some comments which I have set out by reference to some of your slides: Slide Introduction 1 ------------------- Agree Slide Introduction 2 -------------------- Agree that structural characteristics are important. There are many different types of business documents and we will not be able to classify them all by semantics. Slide Potential Structural Characteristic to consider when classifying ---------------------------------------------------------------------- Agree with "Is it narrative?" but I am not sure about many of the other characteristics. The presence of tables or graphics is relevant but most documents can be assumed to include them. Table frequency seems hardly relevant. One complex table opens up all issues about tables. I offer for consideration a model that may be used define the structure of narrative documents. 1. Is it a narrative document? * Is it mainly written description or analysis or similar? * There may be some cases where documents are mixed narrative and structured data eg form content or database records. 2. What are the high level document semantics and structures? Is the document just a sequence of chapters, sections, paragraphs or something else? For example, does the document have major semantic divisions such as (for a business report): Executive summary Main content Schedules/Attachments Note: The main content may have a structure similar to the next example. Or, in say, Ann's Use case model: Scenario Goals Problem Content types Solution Note: It would be possible to represent this example in two ways. One is to use distinct elements for each section of the document. The other is to use a generic element (say topic) for each subject and add metadata if it is necessary to further define its semantics. Working out which approach to take is one of the goals of document analysis. 3. Is it necessary to model these high level semantics (section 2) in the markup? * Are the content structure patterns different in those objects? For example, should the executive summary be confined to just paragraphs and not permit topic or section structures or is it just another container similar to the main part of the document? Another example, do the schedules have a distinct pattern of components and need to be distinguished so they can be given appropriate numbering? * Do we need to process the document structure components as distinct objects for information retrieval or rendering purposes? For example, will someone want to extract the scenarios from a collection of use cases and index them to the documents? Note: See the discussion on content structure patterns at the end of this message. 4. What are the main content structure patterns used in the document? * What are the main document subdivisions (chapters, parts, sections etc) * Are these components arranged hierarchically? * How is the document hierarchy defined? * Is it based on recursive use of the same pattern or are the content structure patterns different at each level? Thus, is it necessary to define distinct elements at each level or can one element be used recursively? * If it is necessary to limit the depth of hierarchy, it may be necessary to use distinctly named elements at each level. Please refer to the later discussion on content structure patterns. 5. Can the document make use of shared components (transclusions) and if so, at what levels in the hierarchy can this occur? 6. What components break the basic hierarchical pattern of narrative content in the document and what patterns are needed to model those components? [I call these "inclusions".] * Common inclusions are tables and graphics. * Other common inclusions are quotations, annotations, examples etc. * Sometimes these types of inclusions are central to the document structure such as Tasks in DITA. However, more often in business documents they are subsidiary to the main document hierarchy. * At what levels in the document hierarchy do inclusions occur? * Are these inclusions part of a paragraph or alternatives to a paragraph? 7. What semantics are used within paragraphs? Are there particular words or phrases that have distinct semantics that must be captured for rendering or other purposes? The classes of inline semantics may be different for different types of documents, say between scientific or technical documents and legal documents. 8. Are tables used and what is their complexity? * Do tables include complex content within cells? If so what is the pattern? Some table cells may include content conforming to a topic pattern, others may have basic paragraph text while others may only include simple numeric data. * What control over table layout do writers require? Can all tables be rendered using a single layout or must users be able to specify the use of different layouts for particular tables? 9. Specific requirements for graphics/images Other slides: ------------- * Most Significant Characteristic * Qualifying Narrative Density I submit that density and length of the document in particular are not useful tools in document analysis. You can have a very short document with a complex structure. I believe it is the content structure patterns that are the key. PM comment on content structure patterns ---------------------------------------- I believe it useful if we introduce a concept of "content structure patterns". A DITA topic is a content structure pattern, albeit a very loose one. A DITA p is another content structure pattern. Because these patterns are rather loose, it is hard to apply them in analysis. I suggest that the content of most narrative documents can be analysed using some simple patterns. While we are working with DITA, I think that using DITA as a pattern template will initially make life difficult for us. However, we will have to eventually map everything to DITA patterns and work out where there are problems. I consider that DITA has a very artificial model because of the use of the map file and topics to create a document. If we analyse many business documents you will find a hierarchy of content groups arranged hierarchically. It may look like this example: 1. Executive summary [One or more paragraphs] 2. Problems with customer's systems 2.1 Content management [One or more paragraphs] 2.2 Web publishing [One or more paragraphs] ..... 3. Customer's objectives [One or more paragraphs] 4. Proposed solution 4.1 Content management 4.1.1 Document storage [One or more paragraphs] 4.1.2 Content authoring [One or more paragraphs] ..... 4.2 Web publishing ..... From this example, it could be argued there is a recursive hierarchy of information groups (Lets use the DITA concept "topic"). In DITA, these topics are not strictly recursive because they are linked into a map that defines the hierarchy. The DITA model also uses section that can occur at one level within a topic. Whether something should be marked as a topic or a section is a difficult choice. In DITA thinking, it is whether the unit of content is a standalone concept. More generally, for many writers, it probably revolves around whether the writer intends the object to be included in a numbering sequence or a contents listing. Without DITA, I would suggest that the pattern in the example above is a recursive hierarchy of topics. Theoretically, the patterns to model such a hierarchy could be (using DTD syntax): Option A (title, (paragraph+ | topic+)) [Either paragraphs or topics but no mixing] Option B (title, (paragraph*, topic+)) [Optional paragraphs preceding the topics] Option C (title, paragraph | topic)*) [arbitrarily mixed paragraphs and topics] The earlier example fits Option A. It is possible to broaden this pattern and make the title optional. If so, it becomes possible to model the clause and subclause pattern found in many legal documents. In clause/subclause patterns, the subclause may or may not have a title. If we take DITA as a given and immutable set of patterns, we will probably ignore this because DITA has its own answers to these things. DITA permits case B out of the box within a topic. Within a map, it just permits topicheads/topicrefs or links to topics. I am not sure if options A or C are possible by specialization of topic. I suspect not. Whether there is value in constraining a pattern to option A is an issue. Option C probably should never be permitted, even though examples exist in real life. DITA does not permit an optional title on topic so clause/subclause patterns in legal documents are a problem for DITA. If we put DITA to one side, the listed patterns do provide a way to analyse content structure patterns in narrative documents. It is then possible to look at any given semantic structure and ask: * Can it be modelled using one of the known patterns or must a new pattern be devised for it? This will determine whether a generic container such as topic can be used. * Even if it can be modelled using a generic pattern such as topic, do the semantics or processing (rendering) requirements dictate that a distinct element should be created or can its semantics be adequately defined by metadata? Another aspect of content structure patterns is the paragraph model. DITA has a model for p. What objects are permitted within the paragraph: * lists? * tables? * graphics? What objects are permitted as siblings of the paragraph? Some of the answers to these questions are highly subjective and can be controversial. It is very difficult to discern a hierarchy for a paragraph and these objects looking at printed or word processing documents. It has to be inferred from punctuation and context. Regards Peter Meyer > -----Original Message----- > From: mboses@invisionresearch.com > [mailto:mboses@invisionresearch.com] > Sent: Tuesday, 22 January 2008 3:55 AM > To: dita-busdocs@lists.oasis-open.org > Subject: [dita-busdocs] Groups - Classification of Business > Documents (Classification_20080114_r2.ppt) uploaded > > Please review Rev 2 of this document which was discussed in the > subcommittee meeting today. Comments are requested. Please > either email > comment documents to Michael (if you make comments in the PowerPoint > itself) or upload comments using the comments feature of the > OASIS site. > > -- Michael Boses > > The document revision named Classification of Business Documents > (Classification_20080114_r2.ppt) has been submitted by > Michael Boses to the > DITA Enterprise Business Documents SC document repository. > This document > is revision #3 of Classification_20080114.ppt. > > Document Description: > This document contains the initial ideas of the subcommittee > concerning the > characteristics that will be used to classify narrative > business documents, > both to define the scope of the subcommittee and the methods > that will be > used to analyze documents. > > View Document Details: > http://www.oasis-open.org/apps/org/workgroup/dita-busdocs/docu > ment.php?document_id=26874 > > Download Document: > http://www.oasis-open.org/apps/org/workgroup/dita-busdocs/down > load.php/26874/Classification_20080114_r2.ppt > > Revision: > This document is revision #3 of Classification_20080114.ppt. > The document > details page referenced above will show the complete revision history. > > > PLEASE NOTE: If the above links do not work for you, your > email application > may be breaking the link into two pieces. You may be able to > copy and paste > the entire link address into the address field of your web browser. > > -OASIS Open Administration
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]