xdi message

Subject: RE: [xdi] Rationale for pursuing Dataweb architecture
From: "Drummond Reed" <drummond.reed@cordance.net>
To: "'Sakimura, Nat'" <n-sakimura@nri.co.jp>,<xdi@lists.oasis-open.org>
Date: Wed, 17 Nov 2004 00:17:48 -0800
Nat, sorry for taking so long to reply - just too much on the plate. See my
answers inline below marked ###.

=Drummond 

-----Original Message-----
From: Sakimura, Nat [mailto:n-sakimura@nri.co.jp] 
Sent: Saturday, November 13, 2004 9:29 AM
To: Drummond Reed; xdi@lists.oasis-open.org
Subject: RE: [xdi] Rationale for pursuing Dataweb architecture

There are couple of points/questions that I would like to make here: 

(1)  Business Card Examples: 
Looking at the Business Card Examples, I see that it has come 
very close to what we have talked over the f2f. Only the deviation 
from what we have discussed there are: 

(a) Tag names: What it was called  <data> or <body> at f2f is 
     changed to <complex> etc. 

### What the most recent "dataweb" schema model proposes is that the <Data>
tag is always wrapped in a <Resource> tag. I'm composing another email that
will explain the rationale for this schema in more depth.

(b) Introduction of <Instance> tag: What is the use for it? 
      I would appreciate an explanation of it. 

### The use of the six XRI type tags (Physical, Logical, Type, Instance,
Version, and now XPath in the v2 dataweb schema) is to increase indexing
efficiency. "Instance" ends out being similar to your "Address" tag, in that
it identifies a specific instance of a Type. From top layer to bottom layer,
the nesting is:
* Physical segments contain Logical segments (if there are logical
authorities represented at a physical endpoint)
* Logical segments contain Type segments (if there are multiple datatypes
representing attributes of a logical authority)
* Type segments contain Instance segments (if there are multiple instances
of a datatype)
* Instance segments contain Version segments (if there are multiple versions
of an instance)
* Version segments contain XPath segments (if there are multiple resources
contained within an XML document in the Data node)

(c) Meta data part are not enclosed in <meta> or <header> tag. 

### Again, the reason for this is that the <Resource> tag already provides a
container. Also, with this schema approach, <Resource>s can contain
<Resource>s directly, without having to look inside <Data> tags, which
increases indexing and traversal efficiency.

(2) Dictionary
Google does not use a dictionary. It is a string indexing. 
Neither OpenText does. If it relies on dictionary, it would not 
be able to search for new concepts, new words, new languages. 
No reliance on the dictionary was the strength of Google. 

### As my message points out, Google was possible precisely because the
dictionaries already existed - they are the natural language dictionaries
humans use implicitly. Without them, we wouldn't know what to search for on
Google. So in this sense Google *does* rely on a dictionary - the dictionary
inherent in human language.

### Secondly, XDI dictionaries are living documents just like any other XDI
documents. They can add new concepts, words, and languages just like they
could be added to a paper dictionary (only much faster). A Google keyword
index could be treated as an XDI dictionary if that was the best way to
approach the problem of dictionary creation and management. Some large
communities (e.g., del.icio.us) are approaching collective metadata
management just this way.

I do not want XDI to be relying on a dictionary in the sense 
it is referred to here. It would be a huge hindrance. We could 
use Dictionary, but not require it. 

### I definitely agree use of an XDI dictionary should never be required.
There are some XDI applications that won't need one. However there are many
other XDI applications where a dictionary (or more likely, an interoperable
community of dictionaries) is precisely what will make data sharing among a
large population feasible. For example, look at the prospective community of
XDI i-brokers. They will need to assist their customers in exchanging
literally thousands of different datatypes, datatypes which may be shared
across hundreds of different XML schemas. Without a set of dictionaries
helping to coordinate the mapping of these datatypes, how is a large data
sharing community like this going to solve the data mapping problem? 

(3) Proliferation of HTML. 
One of the biggest factor for HTML proliferation was its 
astounding simplicity at the outset. One could trivially 
write a page with merely a notepad. Subsequently, HTML 
gained substantial complexity, but I believe that the 
simplicity at the outset was the key. The same is true for 
HTTP. Once could write a HTTPD trivially. 

### I profoundly agree with this point. My devotion to simplicity is
extreme. However, as Einstein said, "Everything should be made as simple as
possible but no simpler."

### My fundamental view is that while HTML was a way to standardize markup
for human-readable documents, XDI is a way to standardize markup for
machine-readable databases. In other words, while XDI documents are XML
documents which of course can be read by a human, they are likely to be
roughly as human-friendly as HTML tables. Why? Because tables are the
natural organizational structure of databases.

### Now, as table markup goes, I believe the simplicity of the XDI schema
should make XDI document structure easy to understand for anyone who spends
15 minutes understanding the basic relationships. After all, there are only
four basic relationships expressed in the dataweb schema: a resource
contains either: a) data, or b) a reference, or c) a collection of
resources, or d) a collection of references. And every resource can be
identified with a combination of one or more of six types of XRIs. That's
it.

### Is this as simple as HTML 1.0? No. But is it as simple as HTML tables?
Yes. And considering we are talking about a single XML schema that can
identify, describe, exchange, link, and synchronize all the data in the
world, that's pretty good for roughly a dozen elements. Even HTML has 40+
elements.

### Again, at the risk of overstating this point, I believe what we're
building with XDI is a logical distributed database, where every node is
identifiable with at least one XRI. The reason XDI adopters will be
motivated to map their existing data sources into this XDI logical database
can be summed up in a twist on the Java slogan: "Map once, link everywhere".

### What I'm hoping we can achieve is the simplest possible design for the
XDI schema that will fulfill the requirements for this logical distributed
database to be able to identify, describe, exchange, link, and synchronize
any type of data as efficiently as possible.

### I still have the assignment from last week's call to compare the
capabilities of the "data envelope" schema approach with the "dataweb"
schema approach. I'll try to get that done before the call tomorrow.

=Drummond 

> -----Original Message-----
> From: Drummond Reed [mailto:drummond.reed@cordance.net] 
> Sent: Thursday, November 11, 2004 9:16 AM
> To: xdi@lists.oasis-open.org
> Subject: [xdi] Rationale for pursuing Dataweb architecture
> 
> XDI TC Members and Observers,
> 
> As published today in the draft minutes of the F2F meeting 
> two weeks ago in Denver (see 
> http://www.oasis-open.org/apps/org/workgroup/xdi/download.php/
> 10001/MINUTES%
> 20OF%2010-28-29-04%20XDI%20TC%20FACE%20TO%20FACE%20MEETING%20%
> 28Official%29.
> txt), the core topic discussed at the meeting was the two 
> potential architectural models the XDI TC could follow.
> 
> These can be loosely summarized as the "data envelope" or 
> "SOAP-for-data"
> model and the "Dataweb" or "HTML-for-data" model.
> 
> While most of you know I am a strong Dataweb architecture 
> advocate, some of the concepts from the data envelope model 
> are very attractive, and they have very much influenced my 
> thinking about the Dataweb model. This is reflected in a new 
> schema proposal and several example documents using this 
> schema that I posted last night:
> 
> * New schema proposal:
> http://www.oasis-open.org/committees/download.php/9988/draft-x
> di-dataweb-sch
> ema-v1.xsd
> 
> * Simple XDI business card (w/all data referenced):
> http://www.oasis-open.org/committees/download.php/9989/draft-e
> xample-dataweb
> -bizcard-short-v1.xml
> 
> * Long-form XDI business card (w/all references resolved):
> http://www.oasis-open.org/committees/download.php/9990/draft-e
> xample-dataweb
> -bizcard-long-v1.xml
> 
> * Example of XDI Descriptor in this XDI format:
> http://www.oasis-open.org/committees/download.php/9991/draft-e
> xample-dataweb
> -XRID-v1.xml
> 
> However, in doing through this work, and after another good 
> conversation with Dave last Friday, I have become more deeply 
> convinced about the Dataweb model. This email summarizes my 
> rationale in preparation for further discussion on today's TC 
> call. It breaks into three parts:
> 
> * Value proposition for the Dataweb
> * The role of XDI dictionaries
> * The need for an XDI Logical Data Object Model (LDOM)
> 
> VALUE PROPOSITION FOR THE DATAWEB
> 
> The root of my rationale is the core value proposition that 
> "XDI can do for global data sharing what the Web did for 
> global content sharing." Here's a more detailed way of 
> framing that value proposition that Dave and I discussed last 
> Friday. It starts with the value proposition for the Web:
> 
> ***Value Proposition for the Web***
> 
> With the Web, we wanted to create a single presentation 
> engine (browser) for all content without knowing anything 
> directly about the content. Besides the visualization markup, 
> the presentation engine doesn't need to know anything about 
> the content.
> 
> Although this presented potentially a huge barrier to 
> adoption - the need for every content publisher to markup 
> their content in this new markup format - there was a value 
> proposition that successfully drove millions of content 
> publishers to do just that:
> 
> 	"If you put your content into this format, it can be: 
> a) rendered on every desktop in the world, and b) referenced 
> and linked to/from any other content in the world, and c) 
> searched and indexed by any content search engine in the world."
> 
> *****
> 
> Bingo! The result is history. The greatest transformation of 
> global information infrastructure ever.
> 
> The core concept of the Dataweb is to do the same thing for 
> machine-readable data that the Web did for human-readable 
> content. In fact, we can express this as literally a 
> word-for-word transposition of the above value
> proposition:
> 
> ***Value Proposition for the Dataweb***
> 
> With the Dataweb, we want to create a single data interchange engine
> (i-broker) for all data without knowing anything directly 
> about the data.
> Besides the data control markup, the data interchange engine 
> doesn't need to know anything about the data.
> 
> Although this presents potentially a huge barrier to adoption 
> - the need for every data publisher to markup their data in 
> this new markup format - there is a value proposition that 
> can successfully drive millions of data publishers to do just that:
> 
> 	"If you put your data into this format, it can be: a) 
> interchanged with every system in the world, and b) 
> referenced and linked to/from any other data in the world, 
> and c) searched and indexed by any database search engine in 
> the world."
> 
> *****
> To me, this perfectly describes the goal of XDI: a common 
> data interchange format (represented by a single common XML 
> schema) together with a common data interchange service for 
> adding, modifying, deleting, and processing XDI documents.
> 
> DATAWEB DICTIONARIES
> 
> Whatsmore, when we're operating at the level of 
> machine-readable data vs.
> human-readable content, I believe there is another major 
> element to the Dataweb value proposition that is missing (in 
> a direct way) from the Web value proposition: Dataweb 
> dictionaries. Again this is probably best described via 
> analogy to the Web.
> 
> Arguably the single most valuable aspect to the Web is the 
> ability to locate desired content almost instantly, using 
> search engines such as Google.
> However this only works because of a simple fact: human 
> languages inherently consist of shared dictionaries of 
> concepts ("keywords") with which the search engines can 
> create their indexes. It is only due to our common knowledge 
> of these dictionaries (the copies we all carry around in our own
> heads) that search engines can do their magic. Otherwise they 
> wouldn't know how to index and we wouldn't know what to enter 
> as search criteria.
> 
> When it comes to the Dataweb, and we move from the sphere of 
> human-readable content to machine-readable data, this problem 
> is magnified immensely. The biggest single problem with 
> sharing machine-readable data across systems is that there 
> are no humans in the loop to do the "fuzzy matching" that 
> humans are so good at (and that search engines like Google 
> can help so much with).
> In order to actually share data across systems, machines need 
> to be able to do *exact bit-for-bit matching*. No ambiguity.
> 
> The problem gets even worse when we consider that today there 
> does not exist anything close to a universal data dictionary 
> from which such matching could be done. In other words, it's 
> not like the Web, where all the dictionaries (common 
> vocabularies of human language) already existed, and we just 
> needed to find a common way to represent them. With the 
> Dataweb, the dictionaries don't even exist yet.
> 
> In fact, the closest thing to those dictionaries are the 
> existing XML schemas or RDF vocabularies that have been 
> created in order to establish common semantics for data interchange.
> 
> So I would argue that, just as it became a fundamental design 
> goal of XML to make XML schemas expressable in XML itself 
> (thus leading to the W3C XML Schemas specification), it must 
> be a fundamental design goal of XDI to make XDI dictionaries 
> expressable in XDI itself. Because unlike XML, which had DTDs 
> to turn to, XDI implementations will have no practical way of 
> interoperating without XDI dictionaries. XDI dictionaries are 
> the only way to get the direct bit-for-bit data matching 
> necessary for true interoperability.
> 
> THE NEED FOR AN XDI DATA OBJECT MODEL
> 
> As discussed above, the Web solved the problem of content 
> interoperability by adopting a single markup format, HTML, 
> which any rendering engine
> (browser) could display. This common format, which later led 
> to the development of XML, also led to a common object model 
> for parsing and manipulating "document objects". This was the 
> Document Object Model (DOM).
> 
> It follows that if data-oriented systems are to adopt a 
> common model for data interchange, and if this model is to be 
> based on a common XML data format, this format must reflect a 
> common logical data object model, or LDOM.
> 
> To be universal, the LDOM must be very simple and capable of 
> expressing fundamental relationships between data elements 
> the same way XML expresses fundamental relationships between 
> content elements. In the work over the past six months, we 
> have been looking at XDI schema proposals that boiled this 
> down to just two types of relationships: 1) hierarchical 
> relationships, and b) peer-to-peer, or "web" relationships.
> 
> The other key requirement of an LDOM is that every data 
> element be uniquely addressable (just as it is in a 
> database). Thus the requirement in the schema proposals so 
> far that every resource be addressable via at least one XRI.
> 
> A successful LDOM, then, would be representable in a single 
> XML schema that, while capable of carrying existing XML data 
> as a "payload", would inherently require markup of some 
> metadata into this new format, just as HTML was capable of 
> carrying existing text and graphics but required at least 
> some markup in HTML format.
> 
> That, in a nutshell, is what I believe we should be driving 
> for with the XDI schema.
> 
> ***EOM***
> 
> 
> 
> 
> 
>
References:
- RE: [xdi] Rationale for pursuing Dataweb architecture
  - From: "Sakimura, Nat" <n-sakimura@nri.co.jp>