ubl-lcsc message

Subject: [ubl-lcsc] Functional Dependency and Normalization paper
From: Tim McGrath <tmcgrath@portcomm.com.au>
To: "'ubl-ndrsc@lists.oasis-open.org'" <ubl-ndrsc@lists.oasis-open.org>,"'ubl-lcsc@lists.oasis-open.org'" <ubl-lcsc@lists.oasis-open.org>
Date: Tue, 10 Sep 2002 16:27:48 +0800
My apologies for being late with this - believe it or not, it has taken 
several re-writes to get it this far!  i also cheated on the number of 
pages by adding two lengthy appendixes.

The paper complements Eve's paper on 'list containers' and describes the 
third type of container - those based on logical groups.

Hopefully we can discuss this and eve's paper during our joint meetings 
this week.


*Grouped Element Containers in UBL

*TOC
  Executive summary
  Introduction and definitions
  The particular value of containers based on logical grouping
  Dependency
  Normalization
  Expressing a normalized model in XML
  Limitations of the normalized model
  Conclusion
  Appendix A.  Example of normalization
  Appendix B.  Example of XML schema construction

*Executive summary

Whilst there is little doubt that we need some grouping of elements 
(i.e. containers) in our logical models and our schemas, this write-up 
considers how we may formalize the identification and design of these 
groups.  Formalization is important to allow consistency and replication 
of the UBL Library development work.  More importantly, correctly 
grouped containers add semantic value to our Library and promote 
re-usable components.

This write-up promotes the idea of defining containers based on 
dependent elements using a technique known as normalization. 

Grouped element containers are semantic constructs and must be logically 
modeled. However, to ensure alignment between the logical model and the 
physical form this write-up also describes an approach to expressing 
these normalized models in XML form.

*Introduction and definitions

Somebody wise once said...
"When we say "container" in this discussion, we mean an XML element, 
plain and simple.  XML is a hierarchical technology, leading to the 
possibility -- indeed the likelihood -- of significantly nested element 
structures in nearly all XML instance documents.

A BIE is a model for a piece of business information to which has been 
applied a semantically unique and useful definition (and of course also 
an identified business context, because it's a BIE and not a CC, but 
that's not important right now).  The containership discussion revolves 
pretty much entirely around ABIEs, which are collections of other BIEs 
and thus have a kind of hierarchy themselves.

Note that our process of turning our logical model (spreadsheet) into a 
physical model (XSD) takes ABIEs and turns them into complex datatypes, 
which each govern one or more XML elements.  So there is more than just 
a vague similarity here -- ABIE hierarchy pretty much turns into XML 
hierarchy, to a first approximation." [Maler, 2002]

We shall use the term structure to refer to aggregations of data 
entities (ABIEs) and the term container to denote these structures in 
XML syntax.

*The particular value of containers based on logical grouping

Well-engineered document schemas need to have clear, unambiguous 
definitions of data, a recognition of the logical sets (or containers) 
in which they belong and the way these sets are related to each other.  
These definitions allow us to minimize redundancy, localize dependencies 
and ensure that information can be maintained in logical sets that 
reflect the constraints of the real world. 

Defining the reusable data structures in documents is something that can 
be done intuitively.  It might sound right to group Name, Address and 
DateOfBirth into a Person container.  However, if we want to have 
strongly re-usable structures we need a more formal and consistent 
approach for grouping components.  

*Dependency

Conventional data modeling practices include formal rules for designing 
logical structures.  In fact, much of what document analysts have done 
in the past, albeit informally, is establishing what data analysts call 
functional dependencies - we will refer to this a simply dependencies.  
We should apply the same rigor to document schema design that we have 
customarily applied to database design.

Dependency means that if the value of an attribute changes when another 
attribute value changes, then the former set is dependent on the 
latter.  Officially this has been defined as:

"Given an entity, E, attribute Y of E is functionally dependent on 
attribute X of E if and only if, whenever two instances of E agree on 
their X-value, they also agree on their Y-value."

For example, suppose the price per sheet of printer paper is reduced if 
the pack size changes from reams to cartons.  This means pricing per 
sheet is dependent on pack size. The values for Name, Address and DateOf 
Birth are all dependent on the specific Person in question.

Examples of dependent elements

X    Pack Size          Sheet    Ream    Carton
Y    Price per sheet    0.14    0.09    0.07

X    Employee         1234    5678    9876   
Y    Name            Jones    Smith    Jones   
Y    Address            Boston    London    London               
Y    Date of Birth    121260    010272    060384               


In database theory, a formal technique for identifying and defining 
these dependencies is known as normalization. 

*Normalization

Normalization is a series of analytic steps that:

1.    Ensures that all data elements in a group are discrete, i.e., can 
only take a single value.  For example, no Person can have more than one 
DateOfBirth. (NB this is what separates this concept from the 'List' 
container type.)

2.    Establishes the primary identifier of each logical group.  For 
Person, this may be the Name of the Person. Obviously this example is 
simplistic; a person's name is not really a practical identifier since 
some names are duplicates (like John Smith).  For this reason we 
generally fabricate an identifier, such as Employee Number or SSID.

3.    Establish groups of data that are fully dependent on each value of 
the primary identifier, i.e., for each instance of the group.  For 
example, each time we introduce a new Person by adding a Name, we can 
also have a DateOfBirth and Address.

4.    Ensures that all members apart from the primary identifier are 
independent of one another.  For example, the value of the DateOfBirth 
does not affect the Address and vice versa.

For database designers, normalization yields sets of relational tables. 
For UBL, normalization yields the logical structures that put containers 
or "depth" into document schemas. The rationale is the same: 
"recognizing dependency is an essential part of understanding the 
meaning or semantics of the data" [Date, C.J. An Introduction to 
Database Systems 3rd Edition, Addison-Wesley, 1981. pp.240-242]. 

*Expressing a normalized model in XML

While the principles of normalization can be applied to the design of 
document schemas to achieve similar goals as in database design, these 
are not identical goals.  Database models and document schemas are 
different in key ways.  Most apparent is that while most databases are 
built using relational structures, documents are generally hierarchical 
in structure. Therefore, the actual implementation of normalized data 
structures in XML schemas will differ from the logical model.  However, 
these differences can be derived and potentially automated. 

To construct XML schema (i.e. containers) from our normalized logical 
model, requires the definition of a hierarchy using pathways through our 
model.  These pathways are determined by the requirements of the 
application or message.  Appendix B describes an approach to how this 
hierarchical pathway can be derived from a normalized data model.  In 
fact, this is similar processing to that required when creating views 
from relational tables in a database application.

*Limitations of the normalized model

Finally we should note that many of these types of design decisions are 
pragmatic and based on the business rules of the required application.  
It may be appropriate to have other form of containers in our schemas.  
However, having the normalized model as a reference allows us to make 
these design decisions consciously and formally rather than on an ad-hoc 
basis.  Not every database or document collection needs a data model 
that has been fully normalized - but it helps to know why it isn't.


*Conclusion

Functional dependency is a semantically meaningful way of aggregating 
sets of BIEs.

Normalization is a reasonably formal technique that helps us establish 
these dependencies in a consistent manner. 

Elegant XML constructs can be formed algorithmically from normalized 
logical models.





Appendix A.  Example of normalization.

This appendix describes the process of developing normalized data 
models.  To do so we shall use the following case study...

"A buyer places an order against a seller.  Sellers are identified by an 
account code. For every item on order we have the unit price and 
quantity required together with a description of the item."

Further analysis of this situation may lead us to identify some 
potentially useful business information entities (BIEs).  These can be 
expressed in a single flat structure.  For example:

order (order number, item number, buyer, seller, account, order date, 
unit price, quantity, item description)

This flat structure we call Zero Normal Form or 0NF.  Normalization also 
has a First, Second and Third Normal form. Whilst it can be extended to 
other higher forms, we shall settle on Third Normal Form as our design goal.

To better understand the dependencies of the BIEs, we should populate 
the structure with some sample data.

order number    item number        buyer         seller name account    
order date         unit price    quantity    item description
A28289            GFS-25            XYZ Co.        WidgetsRUs    WRU    
    12-01-02        16            32000        widgets
003-27898        46372828        XYZ Co.        WWWickets    WWW        
12-01-02        256            4            large wickets
003-27898        46372829        XYZ Co.        WWWickets    WWW        
12-01-02        12            354            small wickets
003-27899        XXXGP            XYZ Co.        WWWickets    WWW        
13-01-02        99            100            gift packs
003-27899        46372829        XYZ Co.        WWWickets    WWW        
13-01-02        12            10            small wickets


* Identifiers and keys
A fundamental principle of normalization is that all structures have a 
unique identifier (known as the primary key).  This establishes the 
identity of each instance of data in the structure.  Furthermore, a 
single BIE may not be sufficiently individual to do this. Sometimes we 
have to use compound keys, such as bank and branch numbers, street 
number and street name, order number and line number, etc. to uniquely 
identify instances of our data structures.  So when we know a primary 
key value, we are referring to one individual identifiable occurrence.  
In our example, this might be:

order (PRIMARY IDENTIFIER [order number], item number, buyer, seller 
name, account, order date, unit price, quantity, item description)

* First Normal Form
The aim of First Normal Form data is to ensure that all of the elements 
are discrete i.e. can only take a single value. This is achieved by the 
removal of repeating groups into their own structures.

In our case we note that we have "every item" on an order. This tells us 
that the BIEs that vary with each item should be separated into a 
structure of their own. As in...

order(PRIMARY IDENTIFIER [order number], buyer, seller name, account, 
order date)
order item(PRIMARY IDENTIFIER [order number, item number], unit price, 
quantity, item description)

In terms of our sample data set, this would look like...

order:
order number    buyer    seller name    account    order date        
A28289            XYZ Co.    WidgetsRUs    WRU        12-01-02   
003-27898        XYZ Co.    WWWickets    WWW        12-01-02   
003-27899        XYZ Co.    WWWickets    WWW        13-01-02   

order item:
order number    item number        unit price    quantity    item 
description
A28289            GFS-25            16            32000        widgets
003-27898        46372828        256            4            large wickets
003-27898        46372829        12            354            small wickets
003-27899        XXXGP            99            100            gift packs
003-27899        46372829        12            10            small wickets

When we established these new repeating structures we also included the 
primary identifier of the original structure (in this case, order 
number).  This is known as a foreign key and is a critical part of 
maintaining the relationships between our data structures.   In our 
case, the foreign key enables us to know which order these items relate to.

[NB First normal form is the point at which we may consider introducing 
the 'list' container types]

* Second Normal Form

The aim of second normal form is to split off into separate tables any 
BIEs that do not wholly depend on the entire key.  This applies when we 
have compound keys (more than one BIE need to uniquely identify an 
instance of a structure).

In our case, if we examine the order item structure we can see that the 
description is dependent on the item involved, but not the order.  By 
this we can interpret that the same item can appear on other orders and 
it will have the same description.  The item description is only 
dependent on the item, not the order. The same might be said of unit 
price. Normally, this would be dependent on the item - not specific to 
an item on a specific order.  As item number is only part of the key of 
order item, second normal form means it must be separated into another 
structure.  For example:
 
order item(PRIMARY IDENTIFIER [order number, item number], quantity)
item(PRIMARY IDENTIFIER [item number], item description, unit price)

In terms of our sample data set, this would look like...

order item:
order number    item number        quantity     
A28289            GFS-25            32000       
003-27898        46372828        4           
003-27898        46372829        354           
003-27899        XXXGP            100           
003-27899        46372829        10           

item:
item number        item description    unit price
GFS-25            widgets                16
46372828        large wickets        256
46372829        small wickets        12
XXXGP            gift packs            99

* Third Normal Form

To achieve a data model in Third Normal Form we must ensure that all 
Non-Key BIEs are independent of one another.  This is similar to Second 
Normal Form, but now we focus on the BIEs that are not part of the 
primary identifier.

In our case, when we examine the order structure we can see a dependency 
between the seller and the account.  It appears the account is a code 
for the seller.  The seller's name is dependent on the account, and 
neither of these are primary identifiers.  Third normal form means we 
move these into their own structure with the account code as the primary 
identifier.  For example:

order(PRIMARY IDENTIFIER [order number], buyer, account, order date)
seller(PRIMARY IDENTIFIER [account], seller name)

This time when we established these new structures we included the 
primary identifier of our new structure in the original structure (in 
this case, account).  The primary identifier of the new structure 
becomes a foreign key in the original.   Now the foreign key enables us 
to know which seller the order relates to.  This construct is common in 
referencing coded values.

In terms of our sample data set, this would look like...

order:
order number    buyer    account    order date        
A28289            XYZ Co.    WRU        12-01-02   
003-27898        XYZ Co.    WWW        12-01-02   
003-27899        XYZ Co.    WWW        13-01-02   

seller:
account        name        
WRU            WidgetsRUs   
WWW            WWWickets   

[NB Third normal form is the point Arofan's paper was making with its 
transport.provider example.]

To complete the exercise, as the order item and item structures are 
already in third normal form (no non-key dependencies), our final model 
looks like this...

order(PRIMARY IDENTIFIER [order number], buyer, account, order date)
seller(PRIMARY IDENTIFIER [account], name)
order item(PRIMARY IDENTIFIER [order number, item number], quantity)
item(PRIMARY IDENTIFIER [item number], item description, unit price)






Appendix B.  Example of XML schema construction.

This appendix describes the formalization of constructing hierarchical 
schemas from normalized data models.  As a case study, we continue with 
the case study in appendix A.  The normalized model looked like...

order(PRIMARY IDENTIFIER [order number], buyer, account, order date)
seller(PRIMARY IDENTIFIER [account], name)
order item(PRIMARY IDENTIFIER [order number, item number], quantity)
item(PRIMARY IDENTIFIER [item number], item description, unit price)

To create an XML schema we create a 'pathway' through our model that 
satisfies the requirements of the document being defined.  We do this 
through implementing relationships by replacing the foreign keys with 
references to the containers themselves.  The cardinality of the 
relationship tells us which container defines the reference.  This 
process results in an hierarchical view of our logical model.

This can be described in four steps.

Step 1. All normalized structures become candidate containers.  All 
their attributes become sub-elements of the container.

<!element order (ordernumber, buyer, account, orderdate)>
<!element seller (account, name)>
<!element orderitem (ordernumber, itemnumber, quantity)>
<!element item (itemnumber, itemdescription, unitprice)>

Not that these are still only candidates and not the final result.  We 
have yet to establish the relationships between these containers. 

Step 2. From every container, take its given elements and replace each 
foreign key with the name of the native container.  For example in the 
container called order, the element called account becomes the container 
called seller.  As in:

<!element order (ordernumber, buyer, seller, order date)>

Similarly, the ordernumber and itemnumber in orderitem are replaced by 
references to the containers, order and item:

<!element orderitem (order, item, quantity)>

Step 3. Start assembling the schema from the root element:

<!element order (ordernumber, buyer, seller, order date)>

For every container that references this container (except ones that 
have already been defined), remove the reference and add this container 
to the root container.  Because this represents a potentially n-ary 
relationship it should be given an unbounded expression.  For example, 
in the container called orderitem, there is an order.  This means we 
should remove order from the orderitem container and add a n-ary 
reference to orderitem in our order container.  As in:

<!element order (ordernumber, buyer, seller, orderdate, orderitem*)>
and
<!element orderitem (item, quantity)>

We can't be more precise about cardinality because the occurrences 
permissible are defined in the model's metadata not in the BIEs themselves.

Step 4. Repeat Step 3. for each container in the current (root) 
container, and recurses through the model.

For example, our root container, called order, has references to the 
seller container.  Step 3. tells us that seller is referenced by order, 
but as this is already in our schema no changes are required.

<!element order (ordernumber, buyer, seller, orderdate, orderitem*)>
<!element seller (account, name)>

There are no further containers within the seller container and so we 
process the other container in the order, orderitem.  Orderitem is also 
not referenced by any other containers, so does not require any 
changes.  However, it does have a reference to the container called 
item.  So we apply step 3. to the item container. The item container 
holds no further containers and so we end up with a schema of...

<!element order (ordernumber, buyer, seller, orderdate, orderitem*)>
<!element seller (account, name)>
<!element orderitem (item, quantity)>
<!element item (itemnumber, itemdescription, unitprice)>

Which is a consistent view of our original logical model...

order(PRIMARY IDENTIFIER [order number], buyer, account, order date)
seller(PRIMARY IDENTIFIER [account], name)
order item(PRIMARY IDENTIFIER [order number, item number], quantity)
item(PRIMARY IDENTIFIER [item number], item description, unit price)



-- 
regards
tim mcgrath
fremantle  western australia 6160
phone: +618 93352228  fax: +618 93352142
References:
- [ubl-lcsc] Functional Dependency and Normalization
  - From: "Burcham, Bill" <Bill_Burcham@stercomm.com>
- [ubl-lcsc] Re: [ubl-ndrsc] Functional Dependency and Normalization
  - From: Tim McGrath <tmcgrath@portcomm.com.au>