[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Content dublication and ODF related RDF vocabulary
Svante Schubert wrote: > Elias Torres wrote: >> Patrick Durusau <patrick@durusau.net> wrote on 12/13/2006 03:43:55 PM: >> [...] >>> Another use case is that I am processing all the metadata files only >>> and >>> not the content.xml files. >>> >>> Trivial example: All patient records are stored as ODF and the metadata >>> for those files should include snomed:birthdate and snomed:age metadata >>> statements, plus I assume snomed:insurer (I assume there is in the >>> snomed vocabulary. Sorry John, could not resist.) >>> >>> In other words, if this data is actually missing from the file, the >>> metadata properties don't either. I don't have to process >>> content.xml to >>> discover these errors. >>> >>> Depending on what metadata you store in the metadata files, like >>> Bruce's >>> bibliographic data, you could extract all that data in RDF without ever >>> touching the content.xml files. >>> >>> Simply a question of how much overhead you think you will incur in >>> processing a set of documents. Doing one to ten documents is probably >>> trivial with either solution. Doing 100,000 documents or more, well, I >>> think there would be performance differences. >>> >>> Granted that you and Bruce are arguing that people can choose one or >>> the >>> other in terms of representation. On the other hand, I don't see any >>> tangible benefit to the choice. If we can indeed do with one what >>> can be >>> done with the other, my instincts say go with the one that we know is >>> likely to scale. >>> >> >> [...] >> RDF by nature deals very well with specifying >> metadata externally from the content, so technically I can't argue >> with an >> external only approach. However, content-duplication is something very >> important that Svante,Barnd, John and others have expressed concerns. >> I'm >> not sure who else, but at the moment you are the only one stating is >> not a >> problem. >> >> Svante wants to avoid content duplication but I believe he is not >> necessarily for RDFa, so I'll look forward to see how he solves the >> problem >> in meta.xml of duplicating content. >> >> > I believe everyone would like to avoid duplication, as it is > equivalent to the risk of inconsistency. > Although Elias might say - as so often - that as the application will > handle the data and not a human will edit it by himself, the risk of > inconsistency between the duplicated data of content and metadata is > about zero, we have to be aware that there is a risk, as soon we have > duplication. > On the other hand as soon we are starting to deal with metadata, there > is always a risk that RDF subject and RDF object are no longer > consistent. For example the name of the person in the content might be > changed and no longer belong to the vcard data in the metadata. > Concerning inconsistency: Sometimes parts of the content might be equivalent with parts of metadata, the same string is being used. We have the following design options: 1. store the string in the metadata, reference it from the content 2. store the string in the content, reference it from the metadata 3. store the string in the content and metadata (content duplication) 4. store the string in the content and embed parts of the metadata as well (RDFa approach) Here are some constraints I heard recently: 1. Some people would like that none metadata-aware ODF application are still able to show such content without parsing the metadata (dropping option 1) 2. Metadata tools should be able to work with metadata without being bothered parsing the content (dropping option 2) 3. Avoid content duplication (dropping option 3) 4. Avoid mixture of content and metadata, to change both more independently (dropping option 4) On the first sight this seems unsolvable, but let us analyze request #3. Is content duplication per se evil? If we save our work from the harddisc on a USB stick, we have made a backup, which is data duplication as well, and it is even recommended. Of course, nobody is working on the backup, that might be the difference, but who is working on the metadata and content at the same time? A different example, if a flight company is selling flights over the Internet, the data from it's database is duplicated for multiple users in their browsers, and by certain mechanisms (e.g. Two-Phase Commit.) able to be stored back consistently. We therefore can say data duplication is not evil per se. We should state the metadata as the model, which have to be kept consistent with the working part in the content. This include scenarios as filling a content used as a template from it's metadata during loading and as well referring to remote metadata (external DB). We might go on and even say, that data duplication gives us some further security. When the content has been changed (by hand or by DOM) and become by accident inconsistent with the metadata, the equivalent string could be validated by comparing with it's representation in the metadata. For example, imagine there is only the name of an account owner equivalent between content and metadata, but further details about the account are given in the metadata. If the owner name has changed (e.g. DOM) in the content, but not the account details in the metadata, this can be validated by comparing the strings and a warning can be given to the user. The common scenario will be that the visible metadata in the content is only a subset of the whole data to be kept consistent as there will be usually more hidden metadata about the content. And therefore in general I am against data duplication, in our case I would even recommend it as a validation possibility (in the case of equal content & metadata strings). Taking separated content and metadata as a design, I would like to give a rough high-level draft proposal on the linking: As stated we keep the content and the metadata completely independent, and now let's say we do all the linking between content and metadata in one special mapping file. We might even use our own metadata for this mapping. Using a RDF metafile with ODF vocabulary, could give us the possibility to explain types of relation between arbitrary data of the package to metadata (usually RDF files, we should try to cover even more). The mapping in RDF should be in general possible, as all data in the content is marked by a xml:id and can be shown as an URI ref referring to the file in the ODF package and in case of a subset to it's xml:id. And the RDF metadata is qualified by URI anyway and might have rdf:ID as well. By defining our own ODF vocabulary we might give some semantic about the relation between resources of the ODF package and metadata. We might define new ODF predicates as odf:resource a general description saying 'is part of ODF package', which could be divided into two subclasses by odf:equal (when content and metadata is the same) and odf:unequal. Further subclasses (for odf:unequal) to ease validation of metadata in the content and type description might be possible (odf:schema). If the content is not equal of a metadata field, but generated it could be marked (odf:generated), for example used for created long & short citation. The generated content might of course have subfields of equal metadata. These ODF predicates would be of enormous help for a default ODF application plug-in handling metadata, especially when no plug-in for certain metadata is installed. In case the metadata is describing various parts of content (e.g. to be important), various xml:id could be collected to a group in this file. [Note: This file might be unique for ODF packages, to ease searching for metadata. If documents are embedded these files will be merged.] Such a xmeta.xml file would make it possible to directly relate metadata to any file in the ODF package. This is promising, as there are customers, who desire to to add metadata to header and footer lines, which are stored in styles.xml. Images, embedded scripts, every file thinkable, might be directly related to metadata and even schema to validate metadata in the content are possible to be referenced. This as a rough draft of a design possibility, detailed examples follow if there are not too many overseen concerns. Regards, Svante
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]