office-requirements message

Subject: Proof of Concept -- We can suck structured data out from ODF comment listarchives

From: robert_weir@us.ibm.com
To: office-requirements@lists.oasis-open.org
Date: Thu, 15 Jan 2009 16:59:15 -0500

When we discussed the quantity of recommendations that might come from our 
Call for Proposals for ODF-Next, one consideration was how we will handle 
all the incoming data.  Manually transcribing data into a spreadsheet, as 
we do now did not sound fun.  And setting up our own web form for comment 
submissions wasn't feasible, because it does not accord with OASIS IPR 
rules, which require public comments to come through the list.

So, I wrote a Python script that goes through list archives and dumps out 
the URL to each post, the author, the subject of the post, and the 
date/time of the post.  I've tested against the office and office-comments 
list, though it should work for any OASIS list archive.  The only 
complication was a slight change in page structure that occurred back in 
January 2003, but I was able to conditionalize some of the logic to handle 
it both ways.

Here is an example dump for the office-comment list:



Although the output format here is less than inspiring, it would be easy 
to make the script output to an ODF spreadsheet file directly, or to a CSV 
file suitable for importing into JIRA.

So I think we're good to go in that department.

Regards,

-Rob

out.zip