[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: referincing entities in office.dtd
Dear All, Since last year we have been working on an application for layout-driven document structure extraction. We set out from PostScript input which we import by a much refined version of the method described by Craig G. Nevill-Manning, Todd Reed and Ian Witten in Software Practice and Experience, 28(5), 481-491. We already get a very consistent parsable output (machine-readable document image) from nearly any source, and plan to further improve our extraction stage by tapping Adobe Font Metric files and writing a custom PostScript Printer Definition. After preclassifying the output we parse it using a library of layout grammars for various logical elements. In other words we're mapping layout instances to XML elements. We are trying to define a set of common document classes. Since we plan to make our code available on CPAN (the Comprehensive Perl Archive Network), interoperability is an issue for us. We therefore took a look at the Office.DTD which is incredibly detailed and complete but just as overwhelming. I would like to define the document class set as a DTD by directly referencing entities in the Office.DTD. This seems trivial but since I have hardly ever worked with DTD's (I have some experience with schemas) this is taking more time than I actually have available. Has anybody on this list done something like this already. Does it really make sense pursuing a strict modular way? Or should I just pick out the entities I need and make my own Doc_Class.DTD I'll be glad for any hints, best regards from Cologne, -- Gustav Vella Institut für Sprachliche Informationsverarbeitung, Universität zu Köln (Department of Linguistic Data Processing, University of Cologne, Germany)
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]