[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: [chairs] need your comments on DocMgmt system requirements
Make a couple of changes in a Word document and store it in CVS (yeah, binary is probably how you'd do it) - it stores the entire file again, rather than the changes - that's what I'm calling inefficient, meaning "using more storage than is necessary to store the changes". I didn't say "bad". I didn't say "unusable". I said "inefficient", and I cordially disagree with your statement that "It is not.".
I have used CVS for over 10 years, and it's a useful place to store source code. It's a lot less useful when storing non-text files, however.
One of the features of CVS and ancilliary programs that I use frequently is a display of the differences between two versions of a file. I don't get that facility when CVS is storing Word documents - all I can do is retrieve the two Word documents and look at them (anyone got a good Diff for Word?). That's a big loss, especially in this environment, where we have multiple authors. To make matters worse, it encourages use of the "track changes" feature of Word, and that produces much larger Word documents...
So what I am asking is: is there a system which will give us this valuable feature for Word documents? (Ideally, also for PDF files) Something that will allow us to see things like "this paragraph was added by Mr Slowsteady on 13 July, and modified by Ms Quicksmart on 14 August". If the answer to that is an expensive document management system, then let's consider it. If it can work on top of CVS so we can use native CVS facilities for text and html files, then that's a bonus.
Tony Rogers
-----Original Message-----
From: Matthew MacKenzie [mailto:mattm@adobe.com]
Sent: Thu 19-Feb-04 7:34
To: Christopher B Ferris
Cc: Rogers, Tony; karl.best@oasis-open.org; Chairs OASIS
Subject: Re: [chairs] need your comments on DocMgmt system requirements
cvs -z9 add -kb mydoc.docYou need to mark the document as "binary", and to not expand keywords. The -z flag tells the client the level of compression to use. I've been using CVS almost daily for 5 years, and there are several binary files in there (jars, zips, docs, pdfs, ps, exe, gz, ...).Is CVS inefficient in storing and versioning MS Word, or other binary documents? No. It is not. Does CVS integrate with MS Word to make the cvs diff command and conflict resolution work? No. If we want that, OASIS will probably want to pony up big bucks for a high end content management system.
On Feb 18, 2004, at 4:23 PM, Christopher B Ferris wrote:
Right, but you can store word docs in CVS... it's just inefficient. As forHTML/XML,it works just fine.Cheers,Christopher FerrisSTSM, Emerging e-business Industry Architectureemail: chrisfer@us.ibm.comblog: http://webpages.charter.net/chrisfer/blog.htmlphone: +1 508 377 9295"Rogers, Tony" <Tony.Rogers@ca.com> wrote on 02/18/2004 03:13:55 PM:
In my experience, CVS doesn't handle MS Word documents well. It is
designed for plain-text source
code, and MS Word's file format doesn't allow it to produce an
economical diff between one version
and the next. This means that it wastes considerable space when
versioning Word. I cannot comment
on its ability to version html, but I suspect it would do much better on
that. Perhaps we should
all be using TeX, because that can be versioned more readily (ah, that
was a joke...)
Is there a tool that would be able to version MS Word more effectively?
I certainly don't know.
Does that mean we shouldn't use Word? I hope not - our TC has found
Word's change tracking rather
useful when working collaboratively.
Tony Rogerstony.rogers@ca.comco-chair UDDI TC-----Original Message-----From: Karl F. Best [mailto:karl.best@oasis-open.org]Sent: Thu 19-Feb-04 2:04To: Norman WalshCc: Chairs OASIS; Jeff LomasSubject: Re: [chairs] need your comments on DocMgmt system requirements
Norman Walsh wrote:
/ "Karl F. Best" <karl.best@oasis-open.org> was heard to say:| I've put together a draft functional requirements document for this| doc mgmt system and would like to get your feedback. It is very| important that we have the requirements correct and complete before
we
| start development of the project -- many of you are developers so
I'm
| sure that you understand the importance of this.High level comments:- I don't think these requirements adequately address the distinctionbetween a development system (where TCs actively revise documents,schemas, etc.) and a publication system (where TCs post workingdrafts, standards, and other "finished" work products).Is the proposal to develop one or the other, or both. If it's one
or
the other, then I think some of these requirements are completelyinappropriate. If it's both, I think it might be useful to specifythem separately. (And whether you imagine having resources to dothem in sequence, or at the same time?)
I've previously thought of having a two-phase system, the first of whichwould provide a "sandbox" for the TC members to collaborate indeveloping a document. Then once the doc reached a certain stage itwould then go into a more controlled environment with e.g. versioningand edited only by the TC. I've gotten the impression that most TCswould only use the second phase, but I could be wrong.Chairs: would you prefer having both of these phases built into the docmgmt system (open collaboration, followed by more rigourous control)? orwould you only use the second?
- There are several places where the requirements seem to beself-contradictory.
Specifics? This is obviously a draft so needs polishing, so suggestionsare welcome.
- I think meeting all of the requirements listed below will be asignificant challenge. A more detailed roadmap, showing stagedprogress with realistic time estimates would be very helpful.
Yeah. That's the next step. But right now I'm just gatheringrequirements. I can't very well write a development schedule until Iknow what it is that we're trying to build.I'd also like suggestions on which parts of this are most important. I'mdebating whether we should try a phased development approach (i.e.provide base functionality now then add a more functionality over time).Looking through the requirements that I have now, though, I'm not surewhich ones we could put off until later.Chairs: suggestions please.
- A number of the features that you describe would seem to be at leastpartially addressed by open source efforts like G-Forge (an opensource version of SourceForge). Are you considering a system likethat, or are you expecting to "roll your own" from scratch.
I'm intending for us to build on top of an existing system. That's why Isaid "probably CVS". We'd be silly to build something from scratch whenthe engine already exists. We'll build some sort of customized webinterface on top of the engine. Once we have the requirements we'll knowwhat it is that we need to build. I'd also like suggestions for theengine; is CVS the way to go, or do people recommend something else?
OASIS DocMgmt Functional Requirements(17 February 2004)General Description: A repository providing storage/management offiles created by TCs, SCs, and other OASIS groups
Technical committees need to be able to store and manage a collectionof resources. Principal among these resources are documents, but it'sreasonable to consider other, related resources as well, includingissue lists, archives, news items, and syndicated content.
The doc mgmt system would store any type of file. Not just specs, butalso the other doc types you mention.Would some of these stored objects be links and not files?
o Probably based on CVS
The requirements for a "development tree" are likely to be somewhatdifferent than the requirements for a "publishing tree". Inparticular, I would expect published standards to be more-or-lessimmutable, to have persistent URIs, etc. In a development tree, thoseconstraints might be quite stifling.CVS supports a development system very well. It's not immediatelyclear to me if it supports a publication system equally well.
I'm certainly not a CVS expert, though I'm aware that it was built fordevelopment rather than documents. So it may not be ideal for what we
want.
Does anyone have suggestions for a better engine, better suited for docdevelopment and publishing, upon which to build our system?
o A separate area in the repository for each TC/SC/group; bothdefault and definable hierarchy within each TC area
Can you elaborate on what you mean by "both default and definable"?What do you have in mind for "default"?
When we create a new TC we would define hierarchy branches for suchthings as e.g. "drafts", "minutes", "contributions" etc. (TBD). Then theTC chair could define additional branches as required. We'd want to keepthe hierarchy as flat as possible to keep the URLs short, and we'd wantsome consistency, but I want to give the TCs some control over there
space.
o All documents are permanently archived (only Admin has deleterights)
In CVS terms, you can delete a document, but you can always recoverit. In a development tree, it's not uncommon to reorganize some codeor a document and want to remove modules from the current "head" ofthe development tree. This goes back to my comment before that therequirements for publication and development are somewhat different.
Maybe this is where the "sandbox" (above) comes in. I don't see the needof permanently archiving early drafts, but once a doc is checked intothe permanent repository it should be permanent.
o All documents are publicly viewable, downloadableo Repository has a web interface for uploading and tree browsing,searching, and retrieval+ Support for all major browsers+ Listing of single files includes filename, title, description,date, creator, and language; listing of packages includes thelist of single files in the package+ Search by filename, title, date, creator, and language; andfull-text search of description and contents.
Does it have other interfaces? Are you describing a front-end for CVShere, or something else? Does it support Web-DAV?
I would expect that most people would want to use a web interface, but Isuppose that power users may want to deal more directly with the engine.But there's also certain safeguards (permissions, restrictions onnaming, etc.) that may require that we use an interface. I don't knowyet; this may depend on the engine.What are the benefits of Web-DAV? (I'm not an expert on this.)
I think it would make sense to address searching as its own top-levelitem. In particular, the description above suggests that every itemwill have a set of metadata that can be searched. Where/when is thismetadata created? Can I add my own? Is it expressed in an open format,an XML vocabulary or RDF or a topic map, or is it proprietary? Howdoes this metadata evolve as documents change in CVS?
I see the metadata as comprised of the fields listed above. TBD. I don'tknow yet how this would be expressed because we havne't selected anengine yet.How does this matter? Yes, we should use XML on principle, but I don'tsee it as a requirement.
As for searching the content, that's clearly going to depend on thetype of content. What types will the system support?
Obviously not all content will be searchable. If somebody uploads a blobthere's not much we'll be able to do with it besides just store it.We will store whatever types of files the TCs need to store.
Persistent URLso At file creation the document is assigned a URL according to theOASIS file naming scheme. The URL will always resolve to the latestversion of the document, regardless of the documents (versioned)filename; a URL will identify a specification throughout its entirelifetime from working draft to OASIS Standard. Previous versions ofthe document will be accessible via a variant of the URL containingthe version number.
This is fine for storing standards but it's in conflict with the useof CVS and the reference above to a "definable hierarchy".
Again, I'm not an expert on what you can and can't do with CVS.Suggestions welcome.
I think this should apply to published standards and work products,but I don't think it can practically be applied to a developmentspace.
If we have a "sandbox" phase then we wouldn't expect a persistent URLfor those items. Only once a doc is checked into the permanentrepository would we do this.
This suggests that the interface to the published standards spacemight require more constraints. I hope that these constraints can beimposed without requiring me to interact with the system only througha web interface.
As above, power users like yourself may wish to talk directly to theengine, but there will be some constraints for security and consistency.If it is practical to enforce those constraints via both a web interfaceas well as a native interface then we will. But if it's not practicalthen we'll have to do everything through a browser.
Multiple file types supportedo TCs will store both source (e.g. MSWord or HTML) and compiled (e.g.PDF) versions of each file; i.e. the repository should not allow aPDF to be checked in without a matching .doc or .html file
Uhm, what about documents that have a source which is neither aproprietary tool or HTML?
The above is not an exhaustive list. I'm just suggesting that bothsource and compiled versions should be in the repository. Anyresponsible developer should agree with this philosophy.
Imposing the requirement that the system check for classes ofdependencies between files of different types is going to be tricky,especially as the specs evolve. Suppose I rebuild the PDF, can I checkit in without checking in a new source document? What if I onlycorrected a formatting bug? If I check in a new source, what happensto the PDF?
Yeah, we'll have to figure this out. How do you do it when you write
code?
I think a lot more detail is required in this part of therequirements.
That's why I'm asking for input.
o HTML files may include graphics which will be stored with the file(use relative URLs?)
What about other cross-document links? What about XML files that referto both HTML and PDF presentations? What about document trees thatconsist of multiple chapters in a hierarchy with a common set offigures?More detail, please.
More input, please.
o use MIME typesPackageso A specification may be composed of multiple documents. The entirepackage may be uploaded or downloaded in a single operation.Individual documents in the package may also be uploaded ordownloaded.
I don't understand what you mean here. Are you suggesting that I mightupload a package (as a ZIP file? as a MIME multi-part related stream?)and then several days later upload a new version of one component inthat package. Having done so, what "version" does the package have?Can I still download the original? Can I download the revised version?Probably the package will just be an HTML file with links to all of thecomponents. In that case the package is updated by editing the links inthe package file. Each of the components are maintained by editing themindividually. Each component, as well as the package file, could haveits own version number or date, but the entire set would collectivelyhave to be versioned. Would this work?
o Support for chapters or parts of a multi-part document (with linksbetween parts); a package could have a ToC with links to theindividual files
I think any attempt to describe the size and shape of a package ("it
will have
a ToC and chapters" or "it will have a starting page and parts") will
be
problematic. Best just to accept that a multi-part document is a
directed
graph (a web).
Would my description (above) of a package work for this? The TC candecide how it wants to structure the multi-part spec.
o Support for modular DTDs (e.g. DocBook)
What does this requirement mean? Do you also mean modular W3C XMLSchemas and RELAX NG grammars? Does this requirement differ from thepreceding one in a particular way?
Pretty much the same, I think, but I'd be happy to hear otherrequirements not met by the above.
o The entire package is addressable via a single URL, as are theindividual documents. The package URL will link to an HTML pagelisting the package contents.
Is that an HTML page constructed by the author of the package, orautomatically from the content of the package? If it's the latter,what constraints, if any, does that impose on the contents of thepackage?
Securityo Check-in/out based on Kavi user authentication; differentpermissions for public, TC members, chair/secretary, etc.o TC members have ??? rights (TBD)o TC Chair and Secretary have create, edit rights for folders andcheckin/out rights for documents in their respective TC areao Admin has admin rights (create, checkin/out, delete of all folders
and files)
o Public has read rights for all documents
How does "admin" differ from chair/secretary?
"Admin" is the OASIS staff administrator of the dc mgmt system.
Kavi integrationo Kavi user acct/pswd used for authentication in doc mgmt systemo Notification to the Kavi group when a document is uploaded (same ascurrent Kavi notification)o The current Kavi doc repository is disabled; links within Kavi willgo to this doc mgmt system instead (i.e. Kavi doc repository ishidden, this one drops in to replace it).o Docs currently in the Kavi repository will continue to beaddressable and viewable by their Kavi URL (allow for migration
over
time)
This requirement and the previous requirement seem to be in conflict.Can you explain how "the links within Kavi will go to this doc mgmtsystem instead" supports the goal that "the Kavi repository willcontinue to be addressable and viewable by their Kavi URL (allow formigration over time)"?
Right now when you're in Kavi you can click on a link for "docrepository" and it will take you to that page in Kavi. I'd like it to goto the new doc mgmt system instead. But we should allow current docs inthe Kavi repository to stay where they're at until the TC wants to movethem, so these docs need to remain addressable by the current URLs.We'll have to keep the Kavi search/browse accessible, but the defaultwould go to the new doc mgmt system.
o When new Kavi group (TC/SC) is created, a doc mgmt area for thatgroup and default folders are automatically created
This goes back to the question of defaults before. What hierarchy doyou have in mind, and what are your motivations for creating it? Ithink it'll be easier in the long run to simply create an emptyhierarchy and let the TCs populate it.If you have in mind that minutes should go in /minutes and pressclippings should go in /press, etc., then I think a detaileddescription of the default hierarchy is required.
See above. Still TBD, but we need both consistency as well as
flexibility.
File naming (automation of this done in a later phase; just do this
manually at first?)
o Naming and versioning of documents follows OASIS file naming schemeo When a new document is created it will be named according to thescheme; automated helps to create/assign a name
This seems to duplicate the requirements expressed under "PersistentURLs". Is it intended to be different? I believe my comments thereapply here as well.
Th eintent is to provide (eventually, maybe a bit later) a GUI to helpname new files conformant with the OASIS doc naming scheme. I envisionpull-downs to select each of the components of the name. But this willprobably be later; the file creator would have to manually name the filefor now.
Localizable interface, with localization to occur in a later phaseLater phase: Count/traffic report of downloads (how many people havedownloaded a particular doc?)
Other later phase items?- Issue tracking?
Sounds like a separate tool. Yes, we need this. Suggestions?
- automatic generation of PDF/HTML from source formats?
Yeah, we could add this, but is there a need? Can't people do this
already?
- validation?
Ditto. Can't you do this already?But, yes, I see the utility of having validation on checkin, andpublishing, as part of a doc mgmt system.
- interactive forms (e.g., the ability to support an interface thatasks a number of questions and then builds an appropriate schemacustomization layer)?
That's the sort of interface I had in mind for the file naming (above).But I see this as a separate tool for later.
- Syndication of announcements- An informal "journal" space (or blog, if you will) for TC membersto outline their thoughts and ideas?
Both of those are separate tools. Not sure how those would be part of adoc mgmt system.Thanks for the feedback. Much appreciated.-Karl
Be seeing you,normP.S. I'm happy to report that your requirements document can be nicelypresented in an open format (plain text, in this case) instead of aproprietary format. I hope that its greater accessibility in thisformat (and the fact that it's six times smaller) can be used todemonstrate once again the value of open standards.(For even more thoughts on this topic, seehttp://www.gnu.org/philosophy/no-word-attachments.html)
--=================================================================Karl F. BestVice President, OASISoffice +1 978.667.5115 x206 mobile +1 978.761.1648karl.best@oasis-open.org http://www.oasis-open.org
___________________________Matthew MacKenzieSenior ArchitectIntelligent Documents Business UnitAdobe Systems Canada Inc.http://www.adobe.com/506 869.0949
--
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]