ubl-lcsc message

Subject: Re: [ubl-lcsc] Document instance UUID
From: Chin Chee-Kai <cheekai@softml.net>
To: Chiusano Joseph <chiusano_joseph@bah.com>
Date: Mon, 10 Mar 2003 12:24:58 +0800 (SGT)
On Fri, 7 Mar 2003, Chiusano Joseph wrote:

>>It sounds like there are 2 levels here:
>>
>>	(1) Document (as an entity)
>>	(2) Transmission of document

Right, we've identified these 2, and 2 additional areas that
look like independent variables of one another and that have
implications to the "identity" of document instances:
	(3) Storage naming
	(4) Versioning

(3) gotta do with naming individual UBL files (or in our trial
case, the XIP files) in the storage file system.  In XIP, we
just conveniently chose FTP as the transport mechanism.  The
transport (FTP) itself does not have an ID assigned to the
transported content (assuming session number != content ID).
As a transport issue, all files received have to be distinct
from one another without looking at the content.  We also
conveniently define a simplistic file instance naming using
date/time.  We also imagine that the transport means could
be different in practical use, such as Purchasing manager
carrying a diskette of UBL (or XIP) files to the factory.

Now, we could end up in a situation having 3 files named
differently in storage, but all having same content (if
a UBL-processor peers into them):
	File_20030301_101023_af81.ubl -- (A)
	File_20030301_101024_39FZ.ubl -- (B)
	PO_1.ubl -- (C)

(A) is sent first and named by the receiving FTP server
instantiating the file with date/time and a 4-character
random string to distinguish "nearby" files within 1 second
of reception.  (B) is received a second later (having same
content as (A)), and (C) is human-transported via diskette
to the receiving server.

The UBL-processor cannot perform some kind of auto-syncing
and conclude that the 3 files are the same, without having
an equality function defined (relates to your (1) above).
By looking at out-of-band info like the filenames generated
as distinguishing IDs from the transport level, a UBL-processor
has to keep the 3 separate as if they have some kind of 
important business meaning.  I'm coming from the point that
UBL-processor can do better than that, since distinguishing
document instances appears as a need universal enough to
be in UBL.


(4) has to do with the process of creating UBL (or XIP)
documents.  While working on XIP, we imagine that documents
do not get instantiating like copying a file.  A paper purchase
order on manufacturing floor needs to get routed via a few
persons before it is finalised as approved.  Along this
work flow process, a UBL-processor (or an equivalent workflow
system which has a UBL-processing submodule) will need to
pinpoint a particular document and pull it out for continuance
or further downstream processing, perhaps linking to a 
database as well.



(2) & (4) points to a need for (1), which is asking for
an identification function that when given a document
instance content, generates a convenient string, number
or something easily manipulatable:
	docID = DOC_ID(document-instance)

(3) points to a need for an equality-test (or comparison) 
function that takes 2 document instance and says whether
they are equal or not:
	equalP = DOC_EQ(docID_1, docID_2)

The ideal case is that we can do
	equalP = DOC_EQ(DOC_ID(document-instance-1),
			DOC_ID(document-instance-2))



>>For (1), I believe that a set of "key" information from a document (much
>>like a relational database table) should be used to unique an "instance"
>>of a document as an entity. 

Yes, that's to find DOC_ID(document-instance).  But what is
"key" is the subject of discussion.  We want to avoid having
vendor-specific definition to what is "key" to avoid 
interoperability issues later.



>>For example (speaking very generically
>>here), if we assume that a PO Number is unique through time, and a PO
>>can be modified, then the PO Number would (of course) not be sufficient
>>to uniquely identify the PO document. Rather, we would need an
>>additional field/element (please pick favorite word) that would signify
>>"iteration".  So the [PO Number + Iteration Number] would be unique over
>>time.

Yes, but as elaborated in my previous emails, the need for
this document-instance ID to be unique across platforms, 
ERP/MRP/DP systems and their respective versions, across 
branches (that may run similar PO numbers together), and
in general, to avoid arbitrary definition of DOC_ID()
function.  We honestly dont really know how many variables
we have to include in DOC_ID() in order that the document
instance will be uniquely comparable to all documents in the
world in all industries across all networks, systems and
software, country and time.  But that work has been done 
by UUID, so it's easy to just use it.



>>For (2), I think the question is: If the same document is transmitted
>>more than once (in the example above, the two transmitted documents
>>would have the same [PO Number + Iteration Number]), what are the
>>ramifications? Are there legal ramifications which would require one of
>>the transmissions to be identified as the "binding" transmission?

Good question.  Subject of great debate with great implications
apparently.  I'd like to say on the basis that I don't really
know what the real answer is, but just to try to find out
what the answers might look, we could always discuss about it
from all aspects, take inputs from different parties like
yourself, and carry out trials (that led to our XIP project).

Just restricting to discussing DOC_ID() and DOC_EQ() and
ignoring other issues such as security, my take #1 is probably 
that legal discussions lean more towards security and related
non-repudiation issues, which indirectly builds on the robustness
of distinguishing document instances from one another.  That
leads us to think of hashing functions for UBL documents as
an ID, which basically is again finding an appropriate number
of parameters to feed the hashing function.
I think there's already some work going on signing XML documents 
in general, but that's for the purpose of not changing the 
document instances, ie. freeze-framing the document as final.

As mentioned earlier, the suggestion of UUID doesn't do so much
on the security side or the freeze-framing.  It's on the other
issues (1)-(4) that I think we're looking at.



>>I am honestly not sure of the answer to that.  But if there are no
>>ramifications (except additional transmission time/money), then I wonder
>>whether it is indeed an issue or not.

Dunno, perhaps it depends on whether at UBL, we want to look
at it as an issue or not.  Ultimately, the effort of finding
DOC_ID() function has to be done;  it's just whether it's in
UBL or not.  As you mentioned, if document instances are saved
in ebXML registry, then your DOC_ID() is equated with ebXML
registry's definition.  But my take #2 is that if all we need
(assuming we need) is a DOC_ID() function and that UUID()
is universally guaranteed to serve the DOC_ID() purpose, is
inexpensive to compute, and provides the uniqueness guarantee
across space-time, why look elsewhere?


Best Regards,
Chin Chee-Kai
SoftML
References:
- Re: [ubl-lcsc] Document instance UUID
  - From: Chiusano Joseph <chiusano_joseph@bah.com>