office-comment message

Subject: referincing entities in office.dtd

From: Gustav Vella <gvella@spinfo.uni-koeln.de>
To: office-comment@lists.oasis-open.org
Date: Fri, 3 Oct 2003 18:06:38 +0200 (CEST)

Dear All,

Since last year we have been working on an application for layout-driven
document structure extraction. We set out from PostScript input which we
import by a much refined version of the method described by Craig G.
Nevill-Manning, Todd Reed and Ian Witten in Software Practice and
Experience, 28(5), 481-491. We already get a very consistent parsable
output (machine-readable document image) from nearly any source, and
plan to further improve our extraction stage by tapping Adobe Font
Metric files and writing a custom PostScript Printer Definition.

After preclassifying the output we parse it using a library of layout
grammars for various logical elements. In other words we're mapping
layout instances to XML elements. We are trying to define a set of
common document classes.

Since we plan to make our code available on CPAN (the Comprehensive Perl
Archive Network), interoperability is an issue for us. We therefore took
a look at the Office.DTD which is incredibly detailed and complete but
just as overwhelming.

I would like to define the document class set as a DTD by directly
referencing entities in the Office.DTD. This seems trivial but since I
have hardly ever worked with DTD's (I have some experience with schemas)
this is taking more time than I actually have available.

Has anybody on this list done something like this already. Does it
really make sense pursuing a strict modular way? Or should I just pick
out the entities I need and make my own Doc_Class.DTD

I'll be glad for any hints,
best regards from Cologne,

-- 
Gustav Vella

Institut für Sprachliche Informationsverarbeitung, Universität zu Köln
(Department of Linguistic Data Processing, University of Cologne, Germany)