OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

opendocument-users message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [opendocument-users] extracting the text from an opendocumentfile


As the developer/maintainer of a scripting language ODF interface from the beginning of the ODF history (http://search.cpan.org/dist/OpenOffice-OODoc), I've never been worrying about the successive changes in the spec. ODF 1.1 was essentially an extension of ODF 1.0. ODG 1.2 will bring a more significant leap in some areas, but probably nothing regarding text extraction features.

While all the text body belongs to the content.xml part and all the basic text containers are <text:p>, <text:h>, and <text:span> elements, the flat text may be extracted in the same way whatever the ODF version. If we need to extract text from page headers and footers, the interesting ODF part remains styles.xml, and in this area all the pieces of text are embedded in <text:p>, <text:h>, and <text:span> elements that belong to the master page header and footer elements, whatever the ODF version, too.

Of course, the things become much more complicated as soon as the applications need to get the structure and the layout, and not only the text content. However, I far as I know, there is no major compatibility break between ODF versions. There was a break between the primary (now deprecated) OpenOffice.org 1.0 format the OASIS/ISO ODF spec, but the later evolutions were essentially additive.

For Java applications, you should have a look at http://odftoolkit.org/projects/odfdom/pages/Home

Jean-Marie Gouarné
http://lpod-project.org/?language=en
http://search.cpan.org/dist/OpenOffice-OODoc
http://jean.marie.gouarne.online.fr

----- Mail Original -----
De: "Chris Puttick" <chris.puttick@thehumanjourney.net>
À: "Vincenzo Morgante" <enzom83@yahoo.it>
Cc: opendocument-users@lists.oasis-open.org
Envoyé: Samedi 9 Janvier 2010 08:49:43 GMT +01:00 Amsterdam / Berlin / Berne / Rome / Stockholm / Vienne
Objet: Re: [opendocument-users] extracting the text from an opendocument file

Hi Vincenzo

AFAIK all text is contained within <office:text> tags in the "content.xml" file within the ODF container, regardless of version of ODF, and if you parse everything between > and < within the <office:txt> tags I think you'd get all displayed content. You might also want the metadata therein also, depending on your need, which I think should all be within meta.xml.

HTH

Chris

----- "Vincenzo Morgante" <enzom83@yahoo.it> wrote:

> Hi,
> I'm developing a java class which have to be able in reading an
> OpenDocument text file (with odt extension) in order to extract all
> the text contained in it.
> Some years ago I made a VB.NET library in following OpenDocument 1.0
> specifications. Now this library works still fine, but I'd like to be
> sure that not be substantial changes in the newer versions of the
> standard (1.1 and 1.2).
> Could I follow the old OpenDocument 1.0 specifications without any
> problems or would it be expedient to follow the newer specifications?
> In other words, if I follow the old OpenDocument 1.0 specifications,
> could I fall into problems in reading a file of the newer versions
> with regard to the text extraction?
> 
> Thanks a lot!
> 
> Vincenzo


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]