chairs message

Subject: RE: [chairs] need your comments on DocMgmt system requirements
From: Christopher B Ferris <chrisfer@us.ibm.com>
To: "Rogers, Tony" <Tony.Rogers@ca.com>
Date: Wed, 18 Feb 2004 16:32:58 -0500
IMO, CVS is no more inefficient than Kavi for storing versions of things 
like Word files.

Cheers,

Christopher Ferris
STSM, Emerging e-business Industry Architecture
email: chrisfer@us.ibm.com
blog: http://webpages.charter.net/chrisfer/blog.html
phone: +1 508 377 9295

"Rogers, Tony" <Tony.Rogers@ca.com> wrote on 02/18/2004 04:09:29 PM:

> Make a couple of changes in a Word document and store it in CVS (yeah, 
binary is probably how 
> you'd do it) - it stores the entire file again, rather than the changes 
- that's what I'm calling 
> inefficient, meaning "using more storage than is necessary to store the 
changes". I didn't say 
> "bad". I didn't say "unusable". I said "inefficient", and I cordially 
disagree with your statement
> that "It is not.".
> 
> I have used CVS for over 10 years, and it's a useful place to store 
source code. It's a lot less 
> useful when storing non-text files, however.
> 
> One of the features of CVS and ancilliary programs that I use frequently 
is a display of the 
> differences between two versions of a file. I don't get that facility 
when CVS is storing Word 
> documents - all I can do is retrieve the two Word documents and look at 
them (anyone got a good 
> Diff for Word?). That's a big loss, especially in this environment, 
where we have multiple 
> authors. To make matters worse, it encourages use of the "track changes" 
feature of Word, and that
> produces much larger Word documents...
> 
> So what I am asking is: is there a system which will give us this 
valuable feature for Word 
> documents? (Ideally, also for PDF files) Something that will allow us to 
see things like "this 
> paragraph was added by Mr Slowsteady on 13 July, and modified by Ms 
Quicksmart on 14 August". If 
> the answer to that is an expensive document management system, then 
let's consider it. If it can 
> work on top of CVS so we can use native CVS facilities for text and html 
files, then that's a bonus.
> 
> Tony Rogers
> -----Original Message----- 
> From: Matthew MacKenzie [mailto:mattm@adobe.com] 
> Sent: Thu 19-Feb-04 7:34 
> To: Christopher B Ferris 
> Cc: Rogers, Tony; karl.best@oasis-open.org; Chairs OASIS 
> Subject: Re: [chairs] need your comments on DocMgmt system requirements

> cvs -z9 add -kb mydoc.doc 
> 
> You need to mark the document as "binary", and to not expand keywords. 
The -z flag tells the 
> client the level of compression to use. I've been using CVS almost daily 
for 5 years, and there 
> are several binary files in there (jars, zips, docs, pdfs, ps, exe, gz, 
...). 
> 
> Is CVS inefficient in storing and versioning MS Word, or other binary 
documents? No. It is not. 
> Does CVS integrate with MS Word to make the cvs diff command and 
conflict resolution work? No. If 
> we want that, OASIS will probably want to pony up big bucks for a high 
end content management system. 
> 

> On Feb 18, 2004, at 4:23 PM, Christopher B Ferris wrote: 
> 
> Right, but you can store word docs in CVS... it's just inefficient. As 
for 
> HTML/XML, 
> it works just fine. 
> 
> Cheers, 
> 
> Christopher Ferris 
> STSM, Emerging e-business Industry Architecture 
> email: chrisfer@us.ibm.com 
> blog: http://webpages.charter.net/chrisfer/blog.html 
> phone: +1 508 377 9295 
> 
> "Rogers, Tony" <Tony.Rogers@ca.com> wrote on 02/18/2004 03:13:55 PM: 
> 
> In my experience, CVS doesn't handle MS Word documents well. It is 
> designed for plain-text source 
> code, and MS Word's file format doesn't allow it to produce an 
> economical diff between one version 
> and the next. This means that it wastes considerable space when 
> versioning Word. I cannot comment 
> on its ability to version html, but I suspect it would do much better on 

> that. Perhaps we should 
> all be using TeX, because that can be versioned more readily (ah, that 
> was a joke...) 
> 
> Is there a tool that would be able to version MS Word more effectively? 
> I certainly don't know. 
> Does that mean we shouldn't use Word? I hope not - our TC has found 
> Word's change tracking rather 
> useful when working collaboratively. 
> 

> Tony Rogers 
> tony.rogers@ca.com 
> co-chair UDDI TC 
> -----Original Message----- 
> From: Karl F. Best [mailto:karl.best@oasis-open.org] 
> Sent: Thu 19-Feb-04 2:04 
> To: Norman Walsh 
> Cc: Chairs OASIS; Jeff Lomas 
> Subject: Re: [chairs] need your comments on DocMgmt system requirements 
> 
> Norman Walsh wrote: 
> / "Karl F. Best" <karl.best@oasis-open.org> was heard to say: 
> | I've put together a draft functional requirements document for this 
> | doc mgmt system and would like to get your feedback. It is very 
> | important that we have the requirements correct and complete before 
> we 
> | start development of the project -- many of you are developers so 
> I'm 
> | sure that you understand the importance of this. 
> 
> High level comments: 
> 
> - I don't think these requirements adequately address the distinction 
> between a development system (where TCs actively revise documents, 
> schemas, etc.) and a publication system (where TCs post working 
> drafts, standards, and other "finished" work products). 
> 
> Is the proposal to develop one or the other, or both. If it's one 
> or 
> the other, then I think some of these requirements are completely 
> inappropriate. If it's both, I think it might be useful to specify 
> them separately. (And whether you imagine having resources to do 
> them in sequence, or at the same time?) 
> 
> I've previously thought of having a two-phase system, the first of which 

> would provide a "sandbox" for the TC members to collaborate in 
> developing a document. Then once the doc reached a certain stage it 
> would then go into a more controlled environment with e.g. versioning 
> and edited only by the TC. I've gotten the impression that most TCs 
> would only use the second phase, but I could be wrong. 
> 
> Chairs: would you prefer having both of these phases built into the doc 
> mgmt system (open collaboration, followed by more rigourous control)? or 

> would you only use the second? 
> 
> - There are several places where the requirements seem to be 
> self-contradictory. 
> 
> Specifics? This is obviously a draft so needs polishing, so suggestions 
> are welcome. 
> 
> - I think meeting all of the requirements listed below will be a 
> significant challenge. A more detailed roadmap, showing staged 
> progress with realistic time estimates would be very helpful. 
> 
> Yeah. That's the next step. But right now I'm just gathering 
> requirements. I can't very well write a development schedule until I 
> know what it is that we're trying to build. 
> 
> I'd also like suggestions on which parts of this are most important. I'm 

> debating whether we should try a phased development approach (i.e. 
> provide base functionality now then add a more functionality over time). 

> Looking through the requirements that I have now, though, I'm not sure 
> which ones we could put off until later. 
> 
> Chairs: suggestions please. 
> 
> - A number of the features that you describe would seem to be at least 
> partially addressed by open source efforts like G-Forge (an open 
> source version of SourceForge). Are you considering a system like 
> that, or are you expecting to "roll your own" from scratch. 
> 
> I'm intending for us to build on top of an existing system. That's why I 

> said "probably CVS". We'd be silly to build something from scratch when 
> the engine already exists. We'll build some sort of customized web 
> interface on top of the engine. Once we have the requirements we'll know 

> what it is that we need to build. I'd also like suggestions for the 
> engine; is CVS the way to go, or do people recommend something else? 
> 
> OASIS DocMgmt Functional Requirements 
> 
> (17 February 2004) 
> 
> General Description: A repository providing storage/management of 
> files created by TCs, SCs, and other OASIS groups 
> 
> Technical committees need to be able to store and manage a collection 
> of resources. Principal among these resources are documents, but it's 
> reasonable to consider other, related resources as well, including 
> issue lists, archives, news items, and syndicated content. 
> 
> The doc mgmt system would store any type of file. Not just specs, but 
> also the other doc types you mention. 
> 
> Would some of these stored objects be links and not files? 
> 
> o Probably based on CVS 
> 
> The requirements for a "development tree" are likely to be somewhat 
> different than the requirements for a "publishing tree". In 
> particular, I would expect published standards to be more-or-less 
> immutable, to have persistent URIs, etc. In a development tree, those 
> constraints might be quite stifling. 
> 
> CVS supports a development system very well. It's not immediately 
> clear to me if it supports a publication system equally well. 
> 
> I'm certainly not a CVS expert, though I'm aware that it was built for 
> development rather than documents. So it may not be ideal for what we 
> want. 
> 
> Does anyone have suggestions for a better engine, better suited for doc 
> development and publishing, upon which to build our system? 
> 
> o A separate area in the repository for each TC/SC/group; both 
> default and definable hierarchy within each TC area 
> 
> Can you elaborate on what you mean by "both default and definable"? 
> What do you have in mind for "default"? 
> 
> When we create a new TC we would define hierarchy branches for such 
> things as e.g. "drafts", "minutes", "contributions" etc. (TBD). Then the 

> TC chair could define additional branches as required. We'd want to keep 

> the hierarchy as flat as possible to keep the URLs short, and we'd want 
> some consistency, but I want to give the TCs some control over there 
> space. 
> 
> o All documents are permanently archived (only Admin has delete 
> rights) 
> 
> In CVS terms, you can delete a document, but you can always recover 
> it. In a development tree, it's not uncommon to reorganize some code 
> or a document and want to remove modules from the current "head" of 
> the development tree. This goes back to my comment before that the 
> requirements for publication and development are somewhat different. 
> 
> Maybe this is where the "sandbox" (above) comes in. I don't see the need 

> of permanently archiving early drafts, but once a doc is checked into 
> the permanent repository it should be permanent. 
> 
> o All documents are publicly viewable, downloadable 
> 
> o Repository has a web interface for uploading and tree browsing, 
> searching, and retrieval 
> 
> + Support for all major browsers 
> 
> + Listing of single files includes filename, title, description, 
> date, creator, and language; listing of packages includes the 
> list of single files in the package 
> 
> + Search by filename, title, date, creator, and language; and 
> full-text search of description and contents. 
> 
> Does it have other interfaces? Are you describing a front-end for CVS 
> here, or something else? Does it support Web-DAV? 
> 
> I would expect that most people would want to use a web interface, but I 

> suppose that power users may want to deal more directly with the engine. 

> But there's also certain safeguards (permissions, restrictions on 
> naming, etc.) that may require that we use an interface. I don't know 
> yet; this may depend on the engine. 
> 
> What are the benefits of Web-DAV? (I'm not an expert on this.) 
> 
> I think it would make sense to address searching as its own top-level 
> item. In particular, the description above suggests that every item 
> will have a set of metadata that can be searched. Where/when is this 
> metadata created? Can I add my own? Is it expressed in an open format, 
> an XML vocabulary or RDF or a topic map, or is it proprietary? How 
> does this metadata evolve as documents change in CVS? 
> 
> I see the metadata as comprised of the fields listed above. TBD. I don't 

> know yet how this would be expressed because we havne't selected an 
> engine yet. 
> 
> How does this matter? Yes, we should use XML on principle, but I don't 
> see it as a requirement. 
> 
> As for searching the content, that's clearly going to depend on the 
> type of content. What types will the system support? 
> 
> Obviously not all content will be searchable. If somebody uploads a blob 

> there's not much we'll be able to do with it besides just store it. 
> 
> We will store whatever types of files the TCs need to store. 
> 
> Persistent URLs 
> 
> o At file creation the document is assigned a URL according to the 
> OASIS file naming scheme. The URL will always resolve to the latest 
> version of the document, regardless of the documents (versioned) 
> filename; a URL will identify a specification throughout its entire 
> lifetime from working draft to OASIS Standard. Previous versions of 
> the document will be accessible via a variant of the URL containing 
> the version number. 
> 
> This is fine for storing standards but it's in conflict with the use 
> of CVS and the reference above to a "definable hierarchy". 
> 
> Again, I'm not an expert on what you can and can't do with CVS. 
> Suggestions welcome. 
> 
> I think this should apply to published standards and work products, 
> but I don't think it can practically be applied to a development 
> space. 
> 
> If we have a "sandbox" phase then we wouldn't expect a persistent URL 
> for those items. Only once a doc is checked into the permanent 
> repository would we do this. 
> 
> This suggests that the interface to the published standards space 
> might require more constraints. I hope that these constraints can be 
> imposed without requiring me to interact with the system only through 
> a web interface. 
> 
> As above, power users like yourself may wish to talk directly to the 
> engine, but there will be some constraints for security and consistency. 

> If it is practical to enforce those constraints via both a web interface 

> as well as a native interface then we will. But if it's not practical 
> then we'll have to do everything through a browser. 
> 
> Multiple file types supported 
> 
> o TCs will store both source (e.g. MSWord or HTML) and compiled (e.g. 
> PDF) versions of each file; i.e. the repository should not allow a 
> PDF to be checked in without a matching .doc or .html file 
> 
> Uhm, what about documents that have a source which is neither a 
> proprietary tool or HTML? 
> 
> The above is not an exhaustive list. I'm just suggesting that both 
> source and compiled versions should be in the repository. Any 
> responsible developer should agree with this philosophy. 
> 
> Imposing the requirement that the system check for classes of 
> dependencies between files of different types is going to be tricky, 
> especially as the specs evolve. Suppose I rebuild the PDF, can I check 
> it in without checking in a new source document? What if I only 
> corrected a formatting bug? If I check in a new source, what happens 
> to the PDF? 
> 
> Yeah, we'll have to figure this out. How do you do it when you write 
> code? 
> 
> I think a lot more detail is required in this part of the 
> requirements. 
> 
> That's why I'm asking for input. 
> 
> o HTML files may include graphics which will be stored with the file 
> (use relative URLs?) 
> 
> What about other cross-document links? What about XML files that refer 
> to both HTML and PDF presentations? What about document trees that 
> consist of multiple chapters in a hierarchy with a common set of 
> figures? 
> 
> More detail, please. 
> 
> More input, please. 
> 
> o use MIME types 
> 
> Packages 
> 
> o A specification may be composed of multiple documents. The entire 
> package may be uploaded or downloaded in a single operation. 
> Individual documents in the package may also be uploaded or 
> downloaded. 
> 
> I don't understand what you mean here. Are you suggesting that I might 
> upload a package (as a ZIP file? as a MIME multi-part related stream?) 
> and then several days later upload a new version of one component in 
> that package. Having done so, what "version" does the package have? 
> Can I still download the original? Can I download the revised version? 
> 
> Probably the package will just be an HTML file with links to all of the 
> components. In that case the package is updated by editing the links in 
> the package file. Each of the components are maintained by editing them 
> individually. Each component, as well as the package file, could have 
> its own version number or date, but the entire set would collectively 
> have to be versioned. Would this work? 
> 
> o Support for chapters or parts of a multi-part document (with links 
> between parts); a package could have a ToC with links to the 
> individual files 
> 
> I think any attempt to describe the size and shape of a package ("it 
> will have 
> a ToC and chapters" or "it will have a starting page and parts") will 
> be 
> problematic. Best just to accept that a multi-part document is a 
> directed 
> graph (a web). 
> 
> Would my description (above) of a package work for this? The TC can 
> decide how it wants to structure the multi-part spec. 
> 
> o Support for modular DTDs (e.g. DocBook) 
> 
> What does this requirement mean? Do you also mean modular W3C XML 
> Schemas and RELAX NG grammars? Does this requirement differ from the 
> preceding one in a particular way? 
> 
> Pretty much the same, I think, but I'd be happy to hear other 
> requirements not met by the above. 
> 
> o The entire package is addressable via a single URL, as are the 
> individual documents. The package URL will link to an HTML page 
> listing the package contents. 
> 
> Is that an HTML page constructed by the author of the package, or 
> automatically from the content of the package? If it's the latter, 
> what constraints, if any, does that impose on the contents of the 
> package? 
> 

> Security 
> 
> o Check-in/out based on Kavi user authentication; different 
> permissions for public, TC members, chair/secretary, etc. 
> 
> o TC members have ??? rights (TBD) 
> 
> o TC Chair and Secretary have create, edit rights for folders and 
> checkin/out rights for documents in their respective TC area 
> 
> o Admin has admin rights (create, checkin/out, delete of all folders 
> and files) 
> 
> o Public has read rights for all documents 
> 

> How does "admin" differ from chair/secretary? 
> 
> "Admin" is the OASIS staff administrator of the dc mgmt system. 
> 
> Kavi integration 
> 
> o Kavi user acct/pswd used for authentication in doc mgmt system 
> 
> o Notification to the Kavi group when a document is uploaded (same as 
> current Kavi notification) 
> 
> o The current Kavi doc repository is disabled; links within Kavi will 
> go to this doc mgmt system instead (i.e. Kavi doc repository is 
> hidden, this one drops in to replace it). 
> 
> o Docs currently in the Kavi repository will continue to be 
> addressable and viewable by their Kavi URL (allow for migration 
> over 
> time) 
> 
> This requirement and the previous requirement seem to be in conflict. 
> Can you explain how "the links within Kavi will go to this doc mgmt 
> system instead" supports the goal that "the Kavi repository will 
> continue to be addressable and viewable by their Kavi URL (allow for 
> migration over time)"? 
> 
> Right now when you're in Kavi you can click on a link for "doc 
> repository" and it will take you to that page in Kavi. I'd like it to go 

> to the new doc mgmt system instead. But we should allow current docs in 
> the Kavi repository to stay where they're at until the TC wants to move 
> them, so these docs need to remain addressable by the current URLs. 
> We'll have to keep the Kavi search/browse accessible, but the default 
> would go to the new doc mgmt system. 
> 
> o When new Kavi group (TC/SC) is created, a doc mgmt area for that 
> group and default folders are automatically created 
> 
> This goes back to the question of defaults before. What hierarchy do 
> you have in mind, and what are your motivations for creating it? I 
> think it'll be easier in the long run to simply create an empty 
> hierarchy and let the TCs populate it. 
> 
> If you have in mind that minutes should go in /minutes and press 
> clippings should go in /press, etc., then I think a detailed 
> description of the default hierarchy is required. 
> 
> See above. Still TBD, but we need both consistency as well as 
> flexibility. 
> 
> File naming (automation of this done in a later phase; just do this 
> manually at first?) 
> 
> o Naming and versioning of documents follows OASIS file naming scheme 
> 
> o When a new document is created it will be named according to the 
> scheme; automated helps to create/assign a name 
> 
> This seems to duplicate the requirements expressed under "Persistent 
> URLs". Is it intended to be different? I believe my comments there 
> apply here as well. 
> 
> Th eintent is to provide (eventually, maybe a bit later) a GUI to help 
> name new files conformant with the OASIS doc naming scheme. I envision 
> pull-downs to select each of the components of the name. But this will 
> probably be later; the file creator would have to manually name the file 

> for now. 
> 
> Localizable interface, with localization to occur in a later phase 
> 
> Later phase: Count/traffic report of downloads (how many people have 
> downloaded a particular doc?) 
> 

> Other later phase items? 
> 
> - Issue tracking? 
> 
> Sounds like a separate tool. Yes, we need this. Suggestions? 
> 
> - automatic generation of PDF/HTML from source formats? 
> 
> Yeah, we could add this, but is there a need? Can't people do this 
> already? 
> 
> - validation? 
> 
> Ditto. Can't you do this already? 
> 
> But, yes, I see the utility of having validation on checkin, and 
> publishing, as part of a doc mgmt system. 
> 
> - interactive forms (e.g., the ability to support an interface that 
> asks a number of questions and then builds an appropriate schema 
> customization layer)? 
> 
> That's the sort of interface I had in mind for the file naming (above). 
> But I see this as a separate tool for later. 
> 
> - Syndication of announcements 
> - An informal "journal" space (or blog, if you will) for TC members 
> to outline their thoughts and ideas? 
> 
> Both of those are separate tools. Not sure how those would be part of a 
> doc mgmt system. 
> 
> Thanks for the feedback. Much appreciated. 
> 
> -Karl 
> 
> 
> 
> 
> 

> 
> Be seeing you, 
> norm 
> 
> P.S. I'm happy to report that your requirements document can be nicely 
> presented in an open format (plain text, in this case) instead of a 
> proprietary format. I hope that its greater accessibility in this 
> format (and the fact that it's six times smaller) can be used to 
> demonstrate once again the value of open standards. 
> 
> (For even more thoughts on this topic, see 
> http://www.gnu.org/philosophy/no-word-attachments.html) 
> 
> 

> -- 
> ================================================================= 
> Karl F. Best 
> Vice President, OASIS 
> office +1 978.667.5115 x206 mobile +1 978.761.1648 
> karl.best@oasis-open.org http://www.oasis-open.org 
> 
> 

> ___________________________ 
> Matthew MacKenzie 
> Senior Architect 
> Intelligent Documents Business Unit 
> Adobe Systems Canada Inc. 
> http://www.adobe.com/ 
> 506 869.0949
References:
- RE: [chairs] need your comments on DocMgmt system requirements
  - From: "Rogers, Tony" <Tony.Rogers@ca.com>