chairs message

Subject: Re: [chairs] need your comments on DocMgmt system requirements
From: Norman Walsh <ndw@nwalsh.com>
To: karl.best@oasis-open.org
Date: Thu, 19 Feb 2004 15:36:41 -0500
/ "Karl F. Best" <karl.best@oasis-open.org> was heard to say:
| Norman Walsh wrote:
[...]
|> High level comments:
|> - I don't think these requirements adequately address the distinction
|>   between a development system (where TCs actively revise documents,
|>   schemas, etc.) and a publication system (where TCs post working
|>   drafts, standards, and other "finished" work products).
|  >
|  >   Is the proposal to develop one or the other, or both. If it's one or
|  >   the other, then I think some of these requirements are completely
|  >   inappropriate. If it's both, I think it might be useful to specify
|  >   them separately. (And whether you imagine having resources to do
|  >   them in sequence, or at the same time?)
|
| I've previously thought of having a two-phase system, the first of
| which would provide a "sandbox" for the TC members to collaborate in
| developing a document. Then once the doc reached a certain stage it
| would then go into a more controlled environment with e.g. versioning
| and edited only by the TC. I've gotten the impression that most TCs
| would only use the second phase, but I could be wrong.

Version control is important in both phases. My point is that
irrespective of whether the system is deployed in two phases or one,
the *requirements* for each are different. I think it would make sense
to construct a requirements document for each.

| Chairs: would you prefer having both of these phases built into the
| doc mgmt system (open collaboration, followed by more rigourous
| control)? or would you only use the second?

I would use both for the TCs in which I have administrative roles (at
the moment, DocBook, Entity Resolution, and RELAX NG).

|> - There are several places where the requirements seem to be
|>   self-contradictory.
|
| Specifics? This is obviously a draft so needs polishing, so
| suggestions are welcome.

I believe I outlined several in my detailed comments that followed.

|> - I think meeting all of the requirements listed below will be a
|>   significant challenge. A more detailed roadmap, showing staged
|>   progress with realistic time estimates would be very helpful.
|
| Yeah. That's the next step. But right now I'm just gathering
| requirements. I can't very well write a development schedule until I
| know what it is that we're trying to build.

Sorry, I wasn't suggesting you put the cart before the horse.

| I'd also like suggestions on which parts of this are most important.
| I'm debating whether we should try a phased development approach (i.e.
| provide base functionality now then add a more functionality over
| time). Looking through the requirements that I have now, though, I'm
| not sure which ones we could put off until later.

The critical need, I think, is the ability to publish specs and
ancillary documents in a stable space. Perhaps it makes sense to
consider how this can be achieved quickly without necessarily waiting
until an entire automated system is in place.

| Chairs: suggestions please.
|
|> - A number of the features that you describe would seem to be at least
|>   partially addressed by open source efforts like G-Forge (an open
|>   source version of SourceForge). Are you considering a system like
|>   that, or are you expecting to "roll your own" from scratch.
|
| I'm intending for us to build on top of an existing system. That's why
| I said "probably CVS". We'd be silly to build something from scratch
| when the engine already exists. We'll build some sort of customized
| web interface on top of the engine. Once we have the requirements
| we'll know what it is that we need to build. I'd also like suggestions
| for the engine; is CVS the way to go, or do people recommend something
| else?

My point was that there are systems that offer more than just the
engine. I was wondering if you were considering a system like that or
if you were intent on building the entire system on top of a
particular engine.

|>>OASIS DocMgmt Functional Requirements
|>>
|>>(17 February 2004)
|>>
|>>General Description: A repository providing storage/management of
|>>files created by TCs, SCs, and other OASIS groups
|> Technical committees need to be able to store and manage a collection
|> of resources. Principal among these resources are documents, but it's
|> reasonable to consider other, related resources as well, including
|> issue lists, archives, news items, and syndicated content.
|
| The doc mgmt system would store any type of file. Not just specs, but
| also the other doc types you mention.
|
| Would some of these stored objects be links and not files?

I have never found that useful.

|>> o Probably based on CVS
|> The requirements for a "development tree" are likely to be somewhat
|> different than the requirements for a "publishing tree". In
|> particular, I would expect published standards to be more-or-less
|> immutable, to have persistent URIs, etc. In a development tree, those
|> constraints might be quite stifling.
|> CVS supports a development system very well. It's not immediately
|> clear to me if it supports a publication system equally well.
|
| I'm certainly not a CVS expert, though I'm aware that it was built for
| development rather than documents. So it may not be ideal for what we
| want.
|
| Does anyone have suggestions for a better engine, better suited for
| doc development and publishing, upon which to build our system?

It's not clear to me that a single engine is right for both systems or
kinds of applications.

|>> o A separate area in the repository for each TC/SC/group; both
|>>   default and definable hierarchy within each TC area
|> Can you elaborate on what you mean by "both default and definable"?
|> What do you have in mind for "default"?
|
| When we create a new TC we would define hierarchy branches for such
| things as e.g. "drafts", "minutes", "contributions" etc. (TBD). Then
| the TC chair could define additional branches as required. We'd want
| to keep the hierarchy as flat as possible to keep the URLs short, and
| we'd want some consistency, but I want to give the TCs some control
| over there space.

My intuition is that no hierarchy will fit every group, but perhaps a
small one could be made to work.

|>> o All documents are permanently archived (only Admin has delete
|>>   rights)
|> In CVS terms, you can delete a document, but you can always recover
|> it. In a development tree, it's not uncommon to reorganize some code
|> or a document and want to remove modules from the current "head" of
|> the development tree. This goes back to my comment before that the
|> requirements for publication and development are somewhat different.
|
| Maybe this is where the "sandbox" (above) comes in. I don't see the
| need of permanently archiving early drafts, but once a doc is checked
| into the permanent repository it should be permanent.

Yes, again, I think this points to the distinction between the two
kinds of repository.

|>> o All documents are publicly viewable, downloadable
|>>
|>> o Repository has a web interface for uploading and tree browsing,
|>>   searching, and retrieval
|>>
|>>     + Support for all major browsers
|>>
|>>     + Listing of single files includes filename, title, description,
|>>       date, creator, and language; listing of packages includes the
|>>       list of single files in the package
|>>
|>>     + Search by filename, title, date, creator, and language; and
|>>       full-text search of description and contents.
|> Does it have other interfaces? Are you describing a front-end for CVS
|> here, or something else? Does it support Web-DAV?
|
| I would expect that most people would want to use a web interface, but
| I suppose that power users may want to deal more directly with the
| engine. But there's also certain safeguards (permissions, restrictions
| on naming, etc.) that may require that we use an interface. I don't
| know yet; this may depend on the engine.

I can't (personally) imagine working with a development repository
that didn't offer command-line access, as CVS does.

| What are the benefits of Web-DAV? (I'm not an expert on this.)

It's a system for doing authoring and versioning over the web. It has
various clients. I don't know how widely deployed it is, although I
gather a number of web servers now support it "out of the box".

|> I think it would make sense to address searching as its own top-level
|> item. In particular, the description above suggests that every item
|> will have a set of metadata that can be searched. Where/when is this
|> metadata created? Can I add my own? Is it expressed in an open format,
|> an XML vocabulary or RDF or a topic map, or is it proprietary? How
|> does this metadata evolve as documents change in CVS?
|
| I see the metadata as comprised of the fields listed above. TBD. I
| don't know yet how this would be expressed because we havne't selected
| an engine yet.

Isn't that asking the question the wrong way around? If metadata is
important, the requirement is to support the metadata. That should
drive selection of the engine, not vice-versa.

| How does this matter? Yes, we should use XML on principle, but I don't
| see it as a requirement.

If the metadata is not extensible, some TC will eventually find that
limiting and they'll resort to coded values in one of the provided
fields or something else.

If the metadata is not available in an open format, then there are
whole classes of applications (such as the TR page builder that the
W3C uses) that can't be written.

I think expressing data and metadata in open formats should be a
requirement of a repository at OASIS.

|> As for searching the content, that's clearly going to depend on the
|> type of content. What types will the system support?
|
| Obviously not all content will be searchable. If somebody uploads a
| blob there's not much we'll be able to do with it besides just store
| it.

I thought we were talking about requirements. One could express the
requirement that all content be searchable. I wouldn't, but one could.
So I think we should spell out what types we think do need to be
searchable. For example, I bet some people feel that Word documents
should be searchable. What about PDFs? What about XML?

| We will store whatever types of files the TCs need to store.
|
|>> Persistent URLs
|>>
|>> o At file creation the document is assigned a URL according to the
|>>   OASIS file naming scheme. The URL will always resolve to the latest
|>>   version of the document, regardless of the documents (versioned)
|>>   filename; a URL will identify a specification throughout its entire
|>>   lifetime from working draft to OASIS Standard. Previous versions of
|>>   the document will be accessible via a variant of the URL containing
|>>   the version number.
|> This is fine for storing standards but it's in conflict with the use
|> of CVS and the reference above to a "definable hierarchy".
|
| Again, I'm not an expert on what you can and can't do with CVS.
| Suggestions welcome.

Perhaps the requirements need to be expressed more generally, without
regard to the features of any particular solution so that we can see
if the proposed solutions match the requirements?

|> I think this should apply to published standards and work products,
|> but I don't think it can practically be applied to a development
|> space.
|
| If we have a "sandbox" phase then we wouldn't expect a persistent URL
| for those items. Only once a doc is checked into the permanent
| repository would we do this.

I think that might work, though I'd like the sandbox documents to have
predictable URLs.

|> This suggests that the interface to the published standards space
|> might require more constraints. I hope that these constraints can be
|> imposed without requiring me to interact with the system only through
|> a web interface.
|
| As above, power users like yourself may wish to talk directly to the
| engine, but there will be some constraints for security and
| consistency. If it is practical to enforce those constraints via both
| a web interface as well as a native interface then we will. But if
| it's not practical then we'll have to do everything through a browser.

That seems reasonable to me.

|>> Multiple file types supported
|>>
|>> o TCs will store both source (e.g. MSWord or HTML) and compiled (e.g.
|>>   PDF) versions of each file; i.e. the repository should not allow a
|>>   PDF to be checked in without a matching .doc or .html file
|> Uhm, what about documents that have a source which is neither a
|> proprietary tool or HTML?
|
| The above is not an exhaustive list. I'm just suggesting that both
| source and compiled versions should be in the repository. Any
| responsible developer should agree with this philosophy.

Yes, but I think the requirements need to be extended to list the
specific constraints proposed.

|> Imposing the requirement that the system check for classes of
|> dependencies between files of different types is going to be tricky,
|> especially as the specs evolve. Suppose I rebuild the PDF, can I check
|> it in without checking in a new source document? What if I only
|> corrected a formatting bug? If I check in a new source, what happens
|> to the PDF?
|
| Yeah, we'll have to figure this out. How do you do it when you write code?

Well, as a general rule, I don't put compiled forms in the repository.
In the cases where I do, the repository is generally unaware of the
dependency graph that drives the production of that compiled form, so
it doesn't attempt to do any dependency checking.

In other words, if I want to check in the V1.0 binary and the V1.5
sources, it's my gun, my bullet, and my foot.

|> I think a lot more detail is required in this part of the
|> requirements.
|
| That's why I'm asking for input.

I wasn't trying to suggest that you should have all the answers, Karl,
I was only trying to provide a thorough review of the draft you
posted.

I suggest that the sort of dependency checking you outlined is
unnecessary in V1.

|>> o HTML files may include graphics which will be stored with the file
|>>   (use relative URLs?)
|> What about other cross-document links? What about XML files that
|> refer
|> to both HTML and PDF presentations? What about document trees that
|> consist of multiple chapters in a hierarchy with a common set of
|> figures?
|> More detail, please.
|
| More input, please.

My personal feeling is that the system should not attempt to impose
restrictions when I check files in. Ideally, I suppose, I imagine a
two-phase checkin process where I provide all the files in phase 1 but
they don't become publicaly visible until phase 2. In order to
transition from phase 1 to phase 2, the system performs link checking,
perhaps validation, and other sanity checks on the documents.

But I don't think that's necessary for V1 either.

|>> o use MIME types
|>>
|>>Packages
|>>
|>> o A specification may be composed of multiple documents. The entire
|>>   package may be uploaded or downloaded in a single operation.
|>>   Individual documents in the package may also be uploaded or
|>>   downloaded.
|> I don't understand what you mean here. Are you suggesting that I
|> might
|> upload a package (as a ZIP file? as a MIME multi-part related stream?)
|> and then several days later upload a new version of one component in
|> that package. Having done so, what "version" does the package have?
|> Can I still download the original? Can I download the revised version?
|
| Probably the package will just be an HTML file with links to all of
| the components. In that case the package is updated by editing the
| links in the package file. Each of the components are maintained by
| editing them individually. Each component, as well as the package
| file, could have its own version number or date, but the entire set
| would collectively have to be versioned. Would this work?

Yes, I think I see what you mean now, thanks. In CVS terms, I think
this would amount to "tagging" a distribution hierarchy.

|>> o Support for chapters or parts of a multi-part document (with links
|>>   between parts); a package could have a ToC with links to the
|>>   individual files
|> I think any attempt to describe the size and shape of a package ("it
|> will have
|> a ToC and chapters" or "it will have a starting page and parts") will be
|> problematic. Best just to accept that a multi-part document is a directed
|> graph (a web).
|
| Would my description (above) of a package work for this? The TC can
| decide how it wants to structure the multi-part spec.

Yes, I think what you've outlined above would be flexible enough.

|>> o Support for modular DTDs (e.g. DocBook)
|> What does this requirement mean? Do you also mean modular W3C XML
|> Schemas and RELAX NG grammars? Does this requirement differ from the
|> preceding one in a particular way?
|
| Pretty much the same, I think, but I'd be happy to hear other
| requirements not met by the above.

I think that would cover it.

|>> o The entire package is addressable via a single URL, as are the
|>>   individual documents. The package URL will link to an HTML page
|>>   listing the package contents.
|> Is that an HTML page constructed by the author of the package, or
|> automatically from the content of the package? If it's the latter,
|> what constraints, if any, does that impose on the contents of the
|> package?

Following your description above, I think it's an HTML page
constructed by the author. I can imagine that it might wind up being
something more interesting. But we probably don't need to pursue that
until the next phase of the requirements process.

|>>Security
|>>
|>> o Check-in/out based on Kavi user authentication; different
|>>   permissions for public, TC members, chair/secretary, etc.
|>>
|>> o TC members have ??? rights (TBD)
|>>
|>> o TC Chair and Secretary have create, edit rights for folders and
|>>   checkin/out rights for documents in their respective TC area
|>>
|>> o Admin has admin rights (create, checkin/out, delete of all folders and files)
|>>
|>> o Public has read rights for all documents
|> How does "admin" differ from chair/secretary?
|
| "Admin" is the OASIS staff administrator of the dc mgmt system.

For the development tree, I think chairs, secretaries, and those
members nominated by the chair or secretary, need full rights
(read/write/update/delete/...)

|>>Kavi integration
|>>
|>> o Kavi user acct/pswd used for authentication in doc mgmt system
|>>
|>> o Notification to the Kavi group when a document is uploaded (same as
|>>   current Kavi notification)
|>>
|>> o The current Kavi doc repository is disabled; links within Kavi will
|>>   go to this doc mgmt system instead (i.e. Kavi doc repository is
|>>   hidden, this one drops in to replace it).
|>>
|>> o Docs currently in the Kavi repository will continue to be
|>>   addressable and viewable by their Kavi URL (allow for migration over
|>>   time)
|> This requirement and the previous requirement seem to be in conflict.
|> Can you explain how "the links within Kavi will go to this doc mgmt
|> system instead" supports the goal that "the Kavi repository will
|> continue to be addressable and viewable by their Kavi URL (allow for
|> migration over time)"?
|
| Right now when you're in Kavi you can click on a link for "doc
| repository" and it will take you to that page in Kavi. I'd like it to
| go to the new doc mgmt system instead. But we should allow current
| docs in the Kavi repository to stay where they're at until the TC
| wants to move them, so these docs need to remain addressable by the
| current URLs. We'll have to keep the Kavi search/browse accessible,
| but the default would go to the new doc mgmt system.

Ok.

|>> o Naming and versioning of documents follows OASIS file naming scheme
|>>
|>> o When a new document is created it will be named according to the
|>>   scheme; automated helps to create/assign a name
|> This seems to duplicate the requirements expressed under "Persistent
|> URLs". Is it intended to be different? I believe my comments there
|> apply here as well.
|
| Th eintent is to provide (eventually, maybe a bit later) a GUI to help
| name new files conformant with the OASIS doc naming scheme. I envision
| pull-downs to select each of the components of the name. But this will
| probably be later; the file creator would have to manually name the
| file for now.

Ok.

|>>Localizable interface, with localization to occur in a later phase
|>>
|>>Later phase: Count/traffic report of downloads (how many people have
|>>downloaded a particular doc?)
|> Other later phase items?
|>   - Issue tracking?
|
| Sounds like a separate tool. Yes, we need this. Suggestions?

There are several choices out there. I find the tracker stuff at SF
works fine. Others use Bugzilla. In any event, I think these things
are complex enough that I'd recommend finding a solution rather than
building one.

|>   - validation?
|
| Ditto. Can't you do this already?
|
| But, yes, I see the utility of having validation on checkin, and
| publishing, as part of a doc mgmt system.

It's a required part of the W3C publication process and despite the
fact that I've done it a dozen times, it still catches me more often
than I'd like to admit.

|>   - interactive forms (e.g., the ability to support an interface that
|>     asks a number of questions and then builds an appropriate schema
|>     customization layer)?
|
| That's the sort of interface I had in mind for the file naming
| (above). But I see this as a separate tool for later.

Definitely.

                                        Be seeing you,
                                          norm

-- 
Norman.Walsh@Sun.COM / XML Standards Architect / Sun Microsystems, Inc.
NOTICE: This email message is for the sole use of the intended
recipient(s) and may contain confidential and privileged information.
Any unauthorized review, use, disclosure or distribution is prohibited.
If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
PGP signature
References:
- need your comments on DocMgmt system requirements
  - From: "Karl F. Best" <karl.best@oasis-open.org>
- Re: [chairs] need your comments on DocMgmt system requirements
  - From: Norman Walsh <ndw@nwalsh.com>
- Re: [chairs] need your comments on DocMgmt system requirements
  - From: "Karl F. Best" <karl.best@oasis-open.org>