OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

ubl-ndrsc message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]


Subject: [ubl-ndrsc] Functional Dependency and Normalization


The theory of functional dependency and normalization has been around a long time and is widely and deeply covered in the literature.  The Library Content Methodology position paper proposes that we apply this theory to the modeling effort in UBL.  Previously, there seemed to be little consensus that this traditional data modeling theory applied to UBL. 
 
Near the end of today's joint call, however, there seemed to be a shift in sentiment.  What I am now hearing is that a) there is a feeling that we need some rules to help us decide when a new Class or ABIE is needed (or when a couple Classes/ABIE's should be collapsed into one) and b) since functional dependency and normalizatoin theory is the only proposal on the table, that folks might be interested in hearing more about what it actually is and how to apply it.
 
To that end I want to augment the presentation in the Library Content Methodology position paper with some deeper coverage from other sources.  Feedback from many early reviewers of the paper said that the normalization discussion didn't make sense to them, or that they failed to see the relevance to our work.  My hope is that by providing a couple more perspectives on the same topic we can together generate a shared understanding of it, and use that understanding to clearly formulate a "bridging theory" -- tying normalizaton theory to the task at hand in a way we all perceive as valuable.
 
Joe Celko, author of SQL For Smarties has graciously agreed to provide the attached material on functional dependency and normalization theory.  It takes some work to get through it, but it's worth the effort.  I chose Joe's work because it strikes a nice balance between formalism and approachability.  Thank you Joe Celko!
 
Ullman and Widom's A First Course in Database Systems also has an excellent, understandable, but more formal presentation on the subject.  Also, Chapter 4 Other Data Models describes pre-relational data models.  Looking at that I'm struck by the impression that the old hierarchical database model is a lot like that of an XML document instance, and that the old network database model is similar to an XML document instance with ID and IDREF (or XLink).  In the 2nd ed of the book, they actually discuss XML.  Unfortunately, I've only got the 1st ed.
 
While the authors of A First Course... are unable to provide us with electronic copy of their work, there are slides and course notes available online.  If you've read the book, then the online stuff makes sense.  Unfortunately, if you're new to the subject the online material isn't all that useful.  On the upside, Jeff Ullman has expressed willingness of his group to provide us tutorial assistance (live).  Thank you Jeff Ullman!  Depending upon our group's response and needs I can coordinate such a tutorial.
 
Here is an outline of the pertinent sections of A First Course... (included here since it isn't included on the Amazon page):
 
3.4 Functional Dependencies
3.4.1 Definition of Functional Dependency
3.4.2 Keys of Relations
3.4.3 Superkeys
3.4.4 Discovering Keys for Relations
3.4.5 Exercises for Section 3.4
3.5 Rules About Functional Dependencies
3.5.1 The Splitting/Combining Rule
3.5.2 Trivial Functional Dependencies
3.5.3 Computing the Closure of Attributes
3.5.4 Why the Closure Algorithm Works
3.5.5 The Transitive Rule
3.5.6 Closing Sets of Functional Dependencies
3.5.7 Projecting Functional Dependencies
3.5.8 Exercises for Section 3.5
3.6 Design of Relational Database Schemas
3.6.1 Anomalies
3.6.2 Decomposing Relations
3.6.3 Boyce-Codd Normal Form
3.6.4 Decomposition into BCNF
3.6.5 Recovering Information from a Decomposition
3.6.6 Third Normal Form
3.6.7 Exercises for Section 3.6
3.7 Multivalued Dependencies
3.7.1 Attribute Independence and Its Consequent Redundancy
3.7.2 Definition of Multivalued Dependencies
3.7.3 Reasoning About Multivalued Dependencies
3.7.4 Fourth Normal Form
3.7.5 Decomposition into Fourth Normal Form
3.7.6 Relationships Among Normal Forms
 
You'll find that Celko covers much of the same ground as Ullman and Widom though with less depth and rigor.  Each has value.
 
I just read the attachment.  There are a number of areas where a mapping is required from the relational model to our hierarchical, single-document one. 
 
One example is the discusson around Multivalued Dependencies or MVD's.  This is a big deal in the relational model, whereas in the XML one we (almost) always just automatically use an element with numOccurs='unbounded'. 
 
Another example is that the arguments around "data anomalies" (insert, update, delete anomalies) don't seem pertinent to our situation where no one "inserts" into a business document.  Rather, we form the business document all at once.  Same issue for updates -- we don't update a business document: we form it and we send it.  I believe that we can motivate the applicability of the theory for our situation, but I believe that to do so we will have to rely on arguments not about data anomalies but about maintainability of processing code (e.g. stylesheets) and about specialization of the vocabulary (specializations, ability to apply context methodology)
 
Celko perhaps alludes to this sort of justification when in the section on 1st Normal Form he says:

"Yes, there are some ways to get around these (normalization) problems without changing the tables. We could permit NULLs in the table. We could write routines to check the table for false data. These are tricks that will only get worse as the data and the relationships become more complex. The solution is to break the table up into other tables, each of which represents one relationship or simple fact."

The implication here is that while it's true that normalization neutralizes data anomalies, and the value of that to UBL may be difficult to see, normalization also contributes to a situation where the model and not the processing logic enforces to a great extent, the consistent structure of the data.  And I think the value of that is easier to see.
 
It's also interesting that in a number of places in the document, Celko points out that the presence of NULL's in data (analogous to optionality in the XML schema) is a red flag -- it usually denotes poor normalization.  It was precisely this analogue (NULL's in data imply poor normalization and that's analogous to optionality in the XML schema) that struck me in Barcelona when someone commented in the plenary that a particular UBL model seemed awfully "flat" and that "everything was optional".

"Avoid NULLs whenever possible. If a table has too many NULLable columns, it is probably not normalized properly. Try to use a NULL only for a value which is missing now, but which will be resolved later. Even better, put missing values into the encoding schemes for that column, as discussed in the section of this book on encoding schemes."

Also, I like this blurb that echoes Jon Bosak's RosettaNet user group annecdote today (they demanded smaller content models):

"A normalized database will tend to have a lot of tables with a small number of columns per table. Don't panic when you see that happen. People who first worked with file systems (particularly on computers which used magnetic tape) tend to design one monster file for an application and do all the work against those records. This made sense in the old days, since there was no reasonable way to JOIN a number of small files together without having the computer operator mount and dismount lots of different tapes. The habit of designing this way carried over to disk systems, since the procedural programming languages were still the same. "

So let those ABIE's/Classes/Relations proliferate!  Let them be small and cohesive -- and let the users go forth and be happy.
 
Finally, here's a mapping of terms between the two, er, four really, worlds.  It ain't perfect, but it's pretty darn good.
 
Pure Relational Model Relational Database Model CC 1.80 XML
relation table ABIE complex type definition
attribute column property XSD (local) element declaration
tuple row instance? element content (in instance document)
NULLABLE NULLABLE N/A XSD minOccurs='0'
NULL NULL N/A (element not present in instance document)
MVD MVD N/A XSD maxOccurs='unbounded'
 
Whew!  'nuff said.
 

 

Bill Burcham
Sr. Software Architect, Standards and Applied Technology
Sterling Commerce, Inc.
469.524.2164
bill_burcham@stercomm.com

 
Working draft of Chapter 2 of Joe Celko's SQL For Smarties.

02. Normalization

The Relational Model and the Normal Forms of the Relational Model were first 
defined by Dr. E. F. Codd (Codd 1970), then extended by other writers after 
him.  He invented the term "normalized relations" by borrowing from the 
political jargon of the day.  A branch of mathematics called relations deals 
with mappings among sets defined by predicate calculus from formal logic.  
Just as in an algebraic equation, there are many forms of the same relational 
statement, but the "normal forms" of relations are certain formally defined 
desirable constructions.  The goal of normal forms is to avoid certain data 
anomalies that can occur in unnormalized tables.  Data anomalies are easier 
to explain with an example, but first please be patient while I define some 
terms.A predicate is a statement of the form A(X), which means that X has the 
property A.  For example, "John is from Indiana" is a predicate statement; 
here, "John" is the subject and "is from Indiana" is the predicate.  A 
relation is a predicate with two or more subjects.  "John and Bob are 
brothers" is an example of a relation.  The common way of visualizing a set 
of relational statements is as a table where the columns are attributes of 
the relation and each row is a specific relational statement.When Dr. Codd 
defined the relational model, he gave 12 rules for the visualization of the 
relation as a table: 

0.  (Yes, there is a rule zero.) For a system to qualify as a relational 
database management system, that system must use its relational facilities 
(exclusively) to manage the database.  SQL is not so pure on this rule, since 
you can often do procedural things to the data.  

1.  The information rule: This simply requires all information in the 
database to be represented in one and only one way, namely by values in 
column positions within rows of tables.  SQL is good here.  

02.  The guaranteed access rule: This rule is essentially a restatement of 
the fundamental requirement for primary keys.  It states that every 
individual scalar value in the database must be logically addressable by 
specifying the name of the containing table, the name of the containing 
column, and the primary key value of the containing row.  SQL follows this 
rule for tables that have a primary key, but does not require a table to have 
a key at all.  

3.  Systematic treatment of NULL values: The DBMS is required to support a 
representation of missing information and inapplicable information that is 
systematic, distinct from all regular values, and independent of datatype.  
It is also implied that such representations must be manipulated by the DBMS 
in a systematic way.  SQL has a NULL that is used for both missing 
information and inapplicable information, rather than having two separate 
tokens as Dr. Codd wished.  

4.  Active online catalog based on the relational model: The system is 
required to support an online, inline, relational catalog that is accessible 
to authorized users by means of their regular query language.  SQL does 
this.  

5.  The comprehensive data sublanguage rule: The system must support at least 
one relational language that (a) has a linear syntax, (b) can be used both 
interactively and within application programs, and (c) supports data 
definition operations (including view definitions), data manipulation 
operations (update as well as retrieval), security and integrity constraints, 
and transaction management operations (begin, commit, and rollback).  SQL is 
pretty good on this point, since all of the operations Codd defined can be 
written in the DML (Data Manipulation Language).  

6.  The view updating rule: All views that are theoretically updatable must 
be updatable by the system.  SQL is weak here, and has elected to standardize 
on the safest case.  View updatability is a very complex problem.  

7.  High-level insert, update, and delete: The system must support 
set-at-a-time INSERT, UPDATE, and DELETE operators.  SQL does this.  

8.  Physical data independence: This is self-explanatory.  Any real product 
is going to have some physical dependence, but SQL is better than most 
programming languages on this point.  

9.  Logical data independence: This is self-explanatory.  SQL is quite good 
about this point.  

10.  Integrity independence: Integrity constraints must be specified 
separately from application programs and stored in the catalog.  It must be 
possible to change such constraints as and when appropriate without 
unnecessarily affecting existing applications.  SQL-92 has this.  

11.  Distribution independence: Existing applications should continue to 
operate successfully (a) when a distributed version of the DBMS is first 
introduced and (b) when existing distributed data is redistributed around the 
system.  We are just starting to get distributed versions of SQL, so it is a 
little early to say whether SQL will meet this criterion or not.  

12.  The non-subversion rule: If the system provides a low-level 
(record-at-a-time) interface, that interface cannot be used to subvert the 
system (e.g., bypassing a relational security or integrity constraint).  
SQL-92 is good about this one.  

Codd also specified 9 structural features, 3 integrity features, and 18 
manipulative features, all of which are required as well.  He later extended 
the list from 12 rules to 333 in the second version of the relational model.  
This section is getting too long and you can look them up for yourself.  

Normal forms are an attempt to make sure that you do not destroy true data or 
create false data in your database.  One of the ways of avoiding errors is to 
represent a fact only once in the database, since if a fact appears more than 
once, one of the instances of it is likely to be in error -- a man with two 
watches can never be sure what time it is.  

This process of table design is called normalization.  It is not mysterious, 
but it can get complex.  You can buy CASE tools to help you do it, but you 
should know a bit about the theory before you use such a tool.  

02.1 Functional and Multivalued Dependencies

A normal form is a way of classifying a table based on the functional 
dependencies (FDs for short) in it.  A functional dependency means that if I 
know the value of one attribute, I can always determine the value of another. 
 The notation used in relational theory is an arrow between the two 
attributes, for example A -> B, which can be read in English as "A determines 
B".  If I know your employee number, I can determine your name; if I know a 
part number, I can determine the weight and color of the part; and so forth.  

A multivalued dependency (MVD) means that if I know the value of one 
attribute, I can always determine the values of a set of another attribute.  
The notation used in relational theory is a double-headed arrow between the 
two attributes, for instance A ..  B, which can be read in English as "A 
determines many Bs".  If I know a teacher's name, I can determine a list of 
her students; if I know a part number, I can determine the part numbers of 
its components; and so forth.  

02.2 First Normal Form (1NF)

Consider a requirement to maintain data about class schedules.  We are 
required to keep the course, section, department, time, room, professor, 
student, major, and grade.  Suppose that we initially set up a Pascal file 
with records that look like this: 

Classes = RECORD 
          course: ARRAY [1:7] OF CHAR;
         section: CHAR;
          time: INTEGER;
          room: INTEGER;
        roomsize: INTEGER;
       professor: ARRAY [1:25] OF CHAR; 
      department: ARRAY [1:10] OF CHAR;   
       Students : ARRAY [1:classsize]
                  OF RECORD
                     student ARRAY [1:25] OF CHAR; 
                     major ARRAY [1:10] OF CHAR;
                     grade CHAR;
                     END;
          END;

This table is not in the most basic normal form of relational databases.  
First Normal Form (1NF) means that the table has no repeating groups.  That 
is, every column is a scalar (or atomic) value, not an array or a list or 
anything with its own structure.  In SQL, it is impossible not to be in 1NF 
unless the vendor has added array or other extensions to the language.  The 
Pascal record could be "flattened out" in SQL to look like this: 

CREATE TABLE Classes 
(course CHAR(7) NOT NULL,
 section CHAR(1) NOT NULL,
 time INTEGER NOT NULL,
 room INTEGER NOT NULL,
 roomsize INTEGER NOT NULL,
 professor CHAR(25) NOT NULL,
 department CHAR(10) NOT NULL,   
 student CHAR (25) NOT NULL, 
 major CHAR(10) NOT NULL,
 grade CHAR(1) NOT NULL);

This table is acceptable to SQL.  In fact, we can locate a row in the table 
with a combination of (course, section, student), so we have a key.  But what 
we are doing is hiding the Students record array, which has not changed its 
nature by being flattened.  There are problems.  

If Professor Jones of the math department dies, we delete all his rows from 
the Classes table.  This also deletes the information that all his students 
were taking a math class and maybe not all of them wanted to drop out of the 
class just yet.  I am deleting more than one fact from the database.  This is 
called a deletion anomaly.  

If student Wilson decides to change one of his math classes, formerly taught 
by Professor Jones, to English, we will show Professor Jones as an instructor 
in both the math and the English departments.  I could not change a simple 
fact by itself.  This creates false information, and is called an update 
anomaly.  

If the school decides to start a new department, which has no students yet, 
we cannot put in the data about the professors we just hired until we have 
classroom and student data to fill out a row.  I cannot insert a simple fact 
by itself.  This is called an insertion anomaly.  

There are more problems in this table, but you see the point.  Yes, there are 
some ways to get around these problems without changing the tables.  We could 
permit NULLs in the table.  We could write routines to check the table for 
false data.  These are tricks that will only get worse as the data and the 
relationships become more complex.  The solution is to break the table up 
into other tables, each of which represents one relationship or simple fact.  

02.2.1 Note on Repeated Groups 

The definition of 1NF is that the table has no repeating groups and that all 
columns are scalar.  This means no arrays, linked lists, tables within 
tables, or record structures, like those you would find in other programming 
languages.  As I have already said, this is very easy to avoid in SQL, since 
arrays and structured data are simply not supported.  

The way you "fake it" in SQL is to use a group of columns which all the 
members of the group have the same semantic value; that is, they represent 
the same attribute in the table.  Consider the table of an employee and his 
children: 

CREATE TABLE Employees
(empno INTEGER NOT NULL, 
 empname CHAR(30) NOT NULL, 
 ...
 child1 CHAR(30), birthday1 DATE, sex1 CHAR(1), 
 child2 CHAR(30), birthday2 DATE, sex2 CHAR(2), 
 child3 CHAR(30), birthday3 DATE, sex3 CHAR(1), 
 child4 CHAR(30), birthday4 DATE, sex4 CHAR(1));

This looks like the layouts of many existing file system records in Cobol and 
other 3GL languages.  The birthday and sex information for each child is part 
of a repeated group and therefore violates 1NF.  This is faking a 
four-element array in SQL!  Most books simply stop at this point and never 
bother to explain why this is good or bad; we will go into detail in chapter 
23, on arrays.  

Suppose I have a table with the quantity of a product sold in each month of a 
particular year and I originally built the table to look like this:

 CREATE TABLE Abnormal
 (product CHAR(10) NOT NULL PRIMARY KEY,
  bin_01 INTEGER,
  bin_02 INTEGER,
  ...
  bin_12 INTEGER);

and I wanted to flatten it out into a more normalized form, like this:

 CREATE TABLE Normal
 (product CHAR(10) NOT NULL,
  bin_nbr INTEGER NOT NULL,
  qty INTEGER NOT NULL,
  PRIMARY KEY (product, bin_nbr));

I can use the statement

INSERT INTO Normal 
SELECT product, 1, bin_01 
  FROM Abnormal 
 WHERE bin_01 IS NOT NULL
UNION ALL
SELECT product, 2, bin_02 
  FROM Abnormal 
 WHERE bin_02 IS NOT NULL

UNION ALL
SELECT product, 12, bin_12 
  FROM Abnormal 
 WHERE bin_12 IS NOT NULL;

While a UNION ALL query is usually slow, this has to be run only once to load 
the normalized table and then the original table can be dropped.  

02.3 Second Normal Form (2NF)

A table is in Second Normal Form (2NF) if it has no partial key dependencies. 
 That is, if X and Y are columns and X is a key, then for any Z that is a 
proper subset of X, it cannot be the case that Z -> Y.  Informally, the table 
is in 1NF and it has a key that determines all non-key attributes in the 
table.  

In the example, our users tell us that knowing the student and course is 
sufficient to determine the section (since students cannot sign up for more 
than one section of the same course) and the grade.  This is the same as 
saying that (student, course) -> (section, grade).  

After more analysis, we also discover from our users that (student -> major) 
-- students have only one major.  Since student is part of the (student, 
course) key, we have a partial key dependency!  This leads us to the 
following decomposition: 

CREATE TABLE Classes 
(course CHAR(7) NOT NULL,
 section CHAR(1) NOT NULL,
 time INTEGER NOT NULL,
 room INTEGER NOT NULL,
 roomsize INTEGER NOT NULL,
 professor CHAR(25) NOT NULL,
PRIMARY KEY (course, section));

CREATE TABLE Enrollment 
(student CHAR (25) NOT NULL, 
 course CHAR(7) NOT NULL,
 section CHAR(1) NOT NULL,
 grade CHAR(1) NOT NULL,
PRIMARY KEY (student, course));

CREATE TABLE Students 
 student CHAR (25) NOT NULL PRIMARY KEY, 
 major CHAR(10) NOT NULL);

At this point, we are in 2NF.  Every attribute depends on the entire key in 
its table.  Now if a student changes majors, it can be done in one place.  
Furthermore, a student cannot sign up for different sections of the same 
class, because we have changed the key of Enrollment.  Unfortunately, we 
still have problems.

Notice that while roomsize depends on the entire key of Classes, it also 
depends on room.  If the room is changed for a course and section, we may 
also have to change the roomsize, and if the room is modified (we knock down 
a wall), we may have to change roomsize in several rows in Classes for that 
room.

02.4 Third Normal Form (3NF)

Another normal form can address these problems.  A table is in Third Normal 
Form (3NF) if for all X -> Y, where X and Y are columns of a table, X is a 
key or Y is part of a candidate key.  (A candidate key is a unique set of 
columns that identify each row in a table; you cannot remove a column from 
the candidate key without destroying its uniqueness.) This implies that the 
table is in 2NF, since a partial key dependency is a type of transitive 
dependency.  Informally, all the non-key columns are determined by the key, 
the whole key, and nothing but the key.  

The usual way that 3NF is explained is that there are no transitive 
dependencies.  A transitive dependency is a situation where we have a table 
with columns (A, B, C) and (A -> B) and (B -> C), so we know that (A -> C).  
In our case, the situation is that (course, section) -> room and room -> 
roomsize.  This is not a simple transitive dependency, since only part of a 
key is involved, but the principle still holds.  To get our example into 3NF 
and fix the problem with the roomsize column, we make the following 
decomposition: 

CREATE TABLE Rooms 
(room INTEGER NOT NULL PRIMARY KEY, 
 roomsize INTEGER NOT NULL);

CREATE TABLE Classes 
(course CHAR(7) NOT NULL,
 section CHAR(1) NOT NULL,
 time INTEGER NOT NULL,
 room INTEGER NOT NULL,
 PRIMARY KEY (course, section));

CREATE TABLE Enrollment 
(student CHAR (25) NOT NULL, 
 course CHAR(7) NOT NULL,
 section CHAR(1) NOT NULL,
 grade CHAR(1) NOT NULL,
 PRIMARY KEY (student, course));

CREATE TABLE Students 
 student CHAR (25) NOT NULL PRIMARY KEY, 
 major CHAR(10) NOT NULL);

A common misunderstanding about relational theory is that 3NF has no 
transitive dependencies.  As indicated above, if X -> Y, X does not have to 
be a key if Y is part of a candidate key.  We still have a transitive 
dependency in the example -- (room, time) -> (course, section) -- but since 
the right side of the dependency is a key, it is technically in 3NF.  The 
unreasonable behavior that this table structure still has is that several 
courses can be assigned to the same room at the same time.  

Another form of transitive dependency is a computed column.  For example:

CREATE TABLE Stuff
(width INTEGER NOT NULL,
 length INTEGER NOT NULL,
 height INTEGER NOT NULL,
 volume INTEGER NOT NULL 
         CHECK (width * length * height = volume),
 PRIMARY KEY (width, length, height));

The volume column is determined by the other three columns, so any change to 
one of the three columns will require a change to the volume column.  You can 
use a VIEW to create computed columns.  

02.5 Case Tools for Normalization 

Third Normal Form is very popular with CASE tools and most of them can 
generate a schema where all of the tables are in 3NF.  They obtain the FDs 
from an E-R (entity-relationship) diagram or from a statistical analysis of 
the existing data, then put them together into tables and check for normal 
forms.  It is often possible to derive more than one 3NF schema from a set of 
FDs.  A good CASE tool will find more than one of them, and ideally will find 
the highest possible normal form schemas too.  Yes, there are still more 
normal forms we have not mentioned yet.  Nobody said this would be easy.Some 
people will argue that it is all right to "denormalize" a schema for reasons 
of efficiency.  For example, to get a list of all the students and their 
majors in a given class, we must JOIN Enrollment and Students.  The case for 
leaving the solution normalized is based on reducing the programming 
requirement.  If we denormalize, either we are not enforcing some of the 
user-required constraints or we have to write additional code to enforce the 
constraints.  

Not enforcing all the rules is obviously bad.  If we choose to write 
additional code, we have to be sure to duplicate it in all of the programs 
that work with the DBMS.  Normalization reduces programming.  

02.6 Boyce-Codd Normal Form (BCNF)

A table is in BCNF when for all nontrivial FDs (X -> A), X is a superkey for 
the whole schema.  A superkey is a unique set of columns that identify each 
row in a table, but you can remove some columns from it and it will still be 
a key.  Informally, a superkey is carrying extra weight.

BCNF is the normal form that actually removes all transitive dependencies.  A 
table is in BCNF if for all (X -> Y), X is a key -- period.  We can go to 
this normal form just by adding another key with UNIQUE (room, time) 
constraint clause to the table Classes.  There are some other interesting and 
useful "higher" normal forms, but they are outside of the scope of this 
discussion.  In our example, we have removed all of the important anomalies 
with BCNF.Third Normal Form was concerned with the relationship between key 
and non-key columns.  However, a column can often play both roles.  Consider 
a table for computing salesmen's bonus gifts that has for each salesman his 
base salary, the number of sales points he has won in a contest, and the 
bonus gift awarded for that combination of salary range and points.  For 
example, we might give a fountain pen to a beginning salesman with a base pay 
rate somewhere between $15,000 and $20,000 and 100 sales points, but give a 
car to a master salesman, whose salary is between $30,000 and $60,000 and who 
has 200 points.  The functional dependencies are, therefore,

(paystep, points) -> gift
gift -> points

Let's start with a table that has all the data in it and normalize it.  

Gifts
salary    points     gift
==========================
$15,000    100    Pencil 
$17,000    100    Pen 
$30,000    200    Car
$31,000    200    Car
$32,000    200    Car

This schema is in 3NF, but it has problems.  You cannot insert a new gift 
into our offerings and points unless we have a salary to go with it.  If you 
remove any sales points, you lose information about the gifts and salaries 
(e.g., only people in the $30,000 range can win a car).  And, finally, a 
change in the gifts for a particular point score would have to affect all the 
rows within the same pay step.  This table needs to be broken apart into two 
tables: 

PayGifts
salary    gift
===============
$15,000    Pencil
$17,000    Pen 
$30,000    Car
$31,000    Car
$32,000    Car

GiftsPoints
gift    points 
===============
Pencil    100
Pen    100
Car     200

02.7 Fourth Normal Form (4NF)

Fourth Normal Form (4NF) makes use of multivalued dependencies.  The problem 
it solves is that the table has too many of them.  For example, consider a 
table of departments, their projects, and the parts they stock.  The MVD's in 
the table would be:

department ->> projects

department ->> parts

Assume that department d1 works on jobs j1, and j2 with parts p1 and p2; that 
department d2 works on jobs j3, j4, and j5 with parts p2 and p4; and that 
department d3 works on job j2 only with parts p5 and p6.  The table would 
look like this:

department job part
=============
d1   j1   p1
d1   j1   p2
d1   j2   p1
d1   j2   p2
d2   j3   p2
d2   j3   p4
d2   j4   p2
d2   j4   p4
d2   j5   p2
d2   j5   p4
d3   j2   p5
d3   j2   p6

If you want to add a part to a department, you must create more than one new 
row.  

Likewise, to remove a part or a job from a row can destroy information.  
Updating a part or job name will also require multiple rows to be changed.  

The solution is to split this table into two tables, one with (department, 
projects) in it and one with (department, parts) in it.  The definition of 
4NF is that we have no more than one MVD in a table.  If a table is in 4NF, 
it is also in BCNF.  

02.8 Fifth Normal Form (5NF)

Fifth Normal Form (5NF), also called the Join-Projection Normal Form or the 
Projection-Join Normal Form, is based on the idea of a lossless JOIN or the 
lack of a join-projection anomaly.  This problem occurs when you have an 
n-way relationship, where n > 2.  A quick check for 5NF is to see if the 
table is in 3NF and all the candidate keys are single columns.As an example 
of the problems solved by 5NF, consider a table of house notes that records 
the buyer, the seller, and the lender: 

HouseNotes
buyer     seller    lender
==================================
Smith     Jones     National Bank
Smith     Wilson    Home Bank
Nelson    Jones     Home Bank

This table is a three-way relationship, but because many CASE tools allow 
only binary relationships it might have to be expressed in an E-R diagram as 
three binary relationships, which would generate CREATE TABLE statements 
leading to these tables: 

BuyerLender
buyer     lender
=============================
Smith      National Bank
Smith      Home Bank
Nelson     Home Bank

SellerLender
seller    lender
=======================
Jones     National Bank
Wilson    Home Bank
Jones     Home Bank

BuyerSeller
buyer   seller
================
Smith   Jones 
Smith   Wilson 
Nelson  Jones 

The trouble is that when you try to assemble the original information by 
joining pairs of these three tables together, thus: 

SELECT BS.buyer, SL.seller, BL.lender    
  FROM BuyerLender AS BL, 
       SellerLender AS SL,
       BuyerSeller AS BS
 WHERE BL.buyer = BS.buyer
   AND BL.lender = SL.lender
   AND SL.seller = BS.seller;

you will recreate all the valid rows in the original table, such as ('Smith', 
'Jones', 'National Bank'), but there will also be false rows, such as 
('Smith', 'Jones', 'Home Bank'), which were not part of the original table.  
This is called a join-projection anomaly.  

There are also strong JPNF and overstrong JPNF, which make use of JOIN 
dependencies (JD for short).  Unfortunately, there is no systematic way to 
find a JPNF or 4NF schema, because the problem is known to be NP complete.  

02.9 Domain-Key Normal Form (DKNF)

A functional dependency has a defined system of axioms that can be used in 
normalization problems.  These six axioms, known as Armstrong's axioms, are 
given below: 

Reflexive:  X -> X
Augmentation: if X -> Y then XZ -> Y
Union: if (X -> Y and X -> Z) then X -> YZ
Decomposition: if X -> Y and Z a subset of Y, then X -> Z
Transitivity: if (X -> Y and Y -> Z) then X -> Z
Pseudotransitivity: if (X -> Y and YZ -> W) then XZ -> W

They make good sense if you just look at them, which is something we like in 
a set of axioms.  In the real world, the FDs are the business rules we are 
trying to model.

In the normalization algorithm for 3NF (developed by P.  A.  Berstein, 1976) 
we use the axioms to get rid of redundant FD's.  For example, if we are given:

A -> B
A -> C
B -> C
DB -> E
DAF -> E

A -> C is redundant because it can be derived from A -> B and B -> C with 
transitivity.  Also DAF -> E is redundant because it can be derived from DB 
-> E and A -> B with transitivity (which gives us DA -> E) and augmentation 
(which then allows DAF -> E).  What we would like to find is the smallest set 
of FDs from which we can generate all of the given rules.  This is called a 
non-redundant cover.  For the FD's above, one cover would be:

A -> B
B -> C
DB -> E

Once we do this Berstein shows that we can just create a table for each of 
the FD's where A, B, and DB are the respective keys.  We have taken it easy 
so far but now it's time for a challenge.  

As an example of a schema with more than one Third Normal Form (3NF) schema, 
here is a problem that was used in a demonstration by DBStar Corporation (San 
Francisco, CA).  The company uses it as an example in a demonstration which 
comes with their CASE tool.

We are given an imaginary and simplified airline which has a database for 
scheduling flights and pilots.  Most of the relationships are obvious things. 
 Flights have only one departure time and one destination.  They can get a 
different pilot and can be assigned to a different gate each day of the week. 
 The functional dependencies for the database are given below:

1) flight -> destination
2) flight -> hour
3) (day, flight) -> gate
4) (day, flight) -> pilot
5) (day, hour, pilot) -> gate
6) (day, hour, pilot) -> flight
7) (day, hour, pilot) -> destination
8) (day, hour, gate) -> pilot
9) (day, hour, gate) -> flight
10) (day, hour, gate) -> destination

Your problem is to find 3NF database schemas in these FD's.  You have to be 
careful!   You have to have all of the columns, obviously, but your answer 
could be in 3NF and still ignore some of the FD's.  For example, this will 
not work:

CREATE TABLE PlannedSchedule
(flight, destination, hour, PRIMARY KEY (flight));

CREATE TABLE ActualSchedule
(day, flight, gate, pilot, PRIMARY KEY (day, flight));

If we apply the Union axiom to some of the FD's, we get:

(day, hour, gate) -> (destination, flight, pilot)
(day, hour, pilot) -> (destination, flight, gate)

This says that the user has required that if we are given a day, an hour, and 
a gate we should be able to determine a unique flight for that day, hour, and 
gate.  We should also be able to determine a unique flight given a day, hour, 
and pilot.

Given the PlannedSchedule and ActualSchedule  tables, you cannot produce 
views where either of the two constraints we just mentioned are enforced.  If 
the query "What flight does pilot X have on day Y and hour Z?" gives you more 
than one answer, it violates the FD's.  Here is an example of a schema which 
is allowable in this proposed schema which is undesirable given our 
constraints:

PlannedSchedule
 flight  hour  destination
 =========================
   118  17:00 Dallas
   123  13:00 Omaha
   155  17:00 Los Angeles
   171  13:00 New York
   666  13:00 Dis

ActualSchedule
 day flight pilot   gate
========================
 Wed    118  Tom     12A
 Wed    155  Tom     13B
 Wed    171  Tom     12A
 Thu    123  John    12A
 Thu    155  John    12A
 Thu    171  John    13B

The constraints mean that we should be able to find a  unique answer to each 
the following questions and not lose any information when inserting and 
deleting data.  

1) Which flight is leaving form gate 12A on Thursdays at 13:00 Hrs?  This 
looks fine until you realize that you don't know about flight 666, which was 
not required to have anything about its day or pilot in the ActualSchedule 
table.  And likewise, I can add a flight to the ActualSchedule table that has 
no information in the PlannedSchedule table.  

2) Which pilot is assigned to the flight that leaves gate 12A on Thursdays at 
13:00 Hrs?  This has the same problem as before.  

3) What is the destination of the flight in query 1 and 2?  This has the same 
problem as before.  

4) What gate is John leaving from on Thursdays at 13:00 Hrs?

5) Where is Tom flying to on Wednesdays at 17:00 Hrs?

6) What flight is assigned to Tom on Wednesdays at 17:00 Hrs?
 

It might help if we gave an example of how one of the FD's in the problem can 
be derived using the axioms of FD calculus, just like you would do a geometry 
proof:

Given:
1) (day, hour, gate) -> pilot
2) (day, hour, pilot) -> flight

prove that:
(day, hour, gate) -> flight.

3) (day, hour) -> (day, hour);          Reflexive
4) (day, hour, gate) -> (day, hour);    Augmentation on 3
5) (day, hour, gate) -> (day, hour, pilot); Union 1 & 4
6) (day, hour, gate) -> flight;  Transitive 2 and 5
Q.E.D

The answer is to start by attempting to derive each of the functional 
dependencies from the rest of the set.  What we get is several short proofs, 
each requiring different "given" functional dependencies in order to get to 
the derived FD.  

Here is a list of each of the proofs used to derive the ten fragmented FD's 
in the problem.  With each derivation we include every derivation step and 
the legal FD calculus operation that allows me to make that step.  An 
additional operation that we include here which was not included in the 
axioms we listed earlier is left reduction.  Left reduction says that if XX 
-> Y then X -> Y.  The reason it was not included is that this is actually a 
theorem and not one of the basic axioms (side problem: can you derive left 
reduction?).

Prove: (day, hour, pilot) -> gate
a) day -> day;               Reflexive
b) (day, hour, pilot) -> day;        Augmentation (a)
c) (day, hour, pilot) -> (day, flight);   Union (6, b)
d) (day, hour, pilot) -> gate;        Transitive (c, 3)
Q.E.D.

Prove: (day, hour, gate) -> pilot
a) day -> day;            Reflexive
b) day, hour, gate -> day;      Augmentation (a)
c) day, hour, gate -> (day, flight);  Union (9, b)
d) day, hour, gate -> pilot;      Transitive (c, 4)
Q.E.D.

Prove: (day, flight) -> gate
a) (day, flight, pilot) -> gate;   Pseudotransitivity (2, 5)
b) (day, flight, day, flight) -> gate; Pseudotransitivity (a, 4)
c) (day, flight) -> gate;      Left reduction (b)
Q.E.D.

Prove:  (day, flight) -> pilot
a) (day, flight, gate) -> pilot;   Pseudotransitivity (2, 8)
b) (day, flight, day, flight) -> pilot; Pseudotransitivity (a, 3)
c) (day, flight) -> pilot;       Left reduction (b)
Q.E.D.

Prove: (day, hour, gate) -> flight
a) (day, hour) -> (day, hour);   Reflexivity
b) (day, hour, gate) -> (day, hour);    Augmentation (a)
c) (day, hour, gate) -> (day, hour, pilot);   Union (b, 8)
d) (day, hour, gate) -> flight;        Transitivity (c, 6)
Q.E.D.

Prove: (day, hour, pilot) -> flight
a) (day, hour) -> (day, hour);  Reflexivity
b) (day, hour, pilot) -> (day, hour); Augmentation (a)
c) (day, hour, pilot) -> day, hour, gate;  Union (b, 5)
d) (day, hour, pilot) -> flight;      Transitivity (c, 9)
Q.E.D.

Prove: (day, hour, gate) -> destination
a) (day, hour, gate) -> destination; Transitivity (9, 1)
Q.E.D.

Prove: (day, hour, pilot) -> destination
a) (day, hour, pilot) -> destination;    Transitivity (6, 1)
Q.E.D.

Now that we've shown you how to derive eight of the ten FD's from other FD's, 
you can try mixing and matching the FD's into sets so that each set meets the 
following criteria:

1) Each attribute must be represented on either the left or right side of at 
least one FD in the set.  

2) If a given FD is included in the set then all the FD's needed to derive it 
cannot also be included.  

3) If a given FD is excluded from the set then the FD's used to derive it 
must be included.

This produces a set of "non-redundant covers", which can be found with trial 
and error and common sense.  For example, if we excluded (day, hour, gate) -> 
flight we must then include (day, hour, gate) -> pilot and vice versa because 
each are used in the other's derivation.  If you want to be sure your search 
was exhaustive, however, you may want to apply a more mechanical method, 
which is what the CASE tools do for you.

The algorithm for accomplishing this task is basically to generate all the 
combinations of sets of the FD's.  (flight -> destination) and (flight -> 
hour) are excluded in the combination generation because they cannot be 
derived.  This gives us (2^8) or 256 combinations of FD's.  Each combination 
is then tested against the criteria.

Fortunately, a simple spreadsheet does all the tedious work.  In this problem 
the criteria one eliminates only 15 sets.  Criteria two eliminates 152 sets 
and criteria three drops another 67.  This leaves us with 22 possible covers, 
five of which are the answers we are looking for (we will explain the other 
17 later).  These five non-redundant covers are:

Set I:
flight -> destination
flight -> hour
(day, hour, gate) -> flight
(day, hour, gate) -> pilot
(day, hour, pilot) -> gate

Set II:
flight -> destination
flight -> hour
(day, hour, gate) -> pilot
(day, hour, pilot) -> flight
(day, hour, pilot) -> gate

Set III:
flight -> destination
flight -> hour
(day, flight) -> gate
(day, flight) -> pilot
(day, hour, gate) -> flight

Set IV:
flight -> destination
flight -> hour
(day, flight) -> gate
(day, hour, gate) -> pilot
(day, hour, pilot) -> flight

Set V:
flight -> destination
flight -> hour
(day, flight) -> pilot
(day, hour, gate) -> flight
(day, hour, pilot) -> gate
(day, hour, pilot) -> flight

At this point we perform unions on FD's with the same left hand side and make 
tables for each grouping with the left hand side as a key.  We can also 
eliminate symmetrical FD's (defined as X -> Y and Y -> X, and written with a 
two headed arrow, X <-> Y) by collapsing them into the same table.

These five non-redundant covers convert into the following five sets of 3NF 
relational schemas.  They are given in a shorthand SQL DDL (Data Declaration 
Language) without data type declarations.

Solution 1:
CREATE TABLE R1 (flight, destination, hour, 
 PRIMARY KEY (flight));
CREATE TABLE R2 (day, hour, gate, flight, pilot, 
  PRIMARY KEY (day, hour, gate));

Solution 2:
CREATE TABLE R1 (flight, destination, hour, 
  PRIMARY KEY (flight));
CREATE TABLE R2 (day, hour, gate, flight, pilot, 
  PRIMARY KEY (day, hour, pilot));

Solution 3:
CREATE TABLE R1 (flight, destination, hour, PRIMARY KEY (flight));
CREATE TABLE R2 (day, flight, gate, pilot, 
  PRIMARY KEY (day, flight));
CREATE TABLE R3 (day, hour, gate, flight, 
  PRIMARY KEY (day, hour, gate));
CREATE TABLE R4 (day, hour, pilot, flight, 
  PRIMARY KEY (day, hour, pilot));

Solution 4:
CREATE TABLE R1 (flight, destination, hour, 
  PRIMARY KEY (flight));
CREATE TABLE R2 (day, flight, gate, PRIMARY KEY (day, flight));
CREATE TABLE R3 (day, hour, gate, pilot, 
  PRIMARY KEY (day, hour, gate));
CREATE TABLE R4 (day, hour, pilot, flight, 
  PRIMARY KEY (day, hour, pilot));

Solution 5:
CREATE TABLE R1 (flight, destination, hour, 
  PRIMARY KEY (flight));
CREATE TABLE R2 (day, flight, pilot, PRIMARY KEY (day, flight));
CREATE TABLE R3 (day, hour, gate, flight, 
  PRIMARY KEY (day, hour, gate));
CREATE TABLE R4 (day, hour, pilot, gate, 
  PRIMARY KEY (day, hour, pilot));

Once you match up these solutions with the minimal covers that generated 
them, you will probably notice that the first two solutions have transitive 
dependencies.  But they are still 3NF!   This is a point not well understood 
by most analysts.  A relation is in 3NF if for each FD X -> Y then X is a 
superkey OR Y is part of a candidate key.  The first two solutions meet this 
criteria.

You may also notice that there are no additional candidate keys defined in 
any of the tables.  This would make sense in the first two solutions but this 
was not done (this is why they are in 3NF and not BCNF).  You will find this 
algorithm used in  CASE tool software because SQL-89 only allowed you to 
define PRIMARY KEY constraints.  SQL-92 allows you to define a UNIQUE 
constraint on one or more columns in addition.  Most implementations of SQL 
also allow the user to define unique indexes on a subset of the columns.

All of the five solutions are 3NF, but since the first two solutions leave 
out two FD's.  It appears that solutions without all the constraints are 
considered valid by this particular automated normalizer.  These tables could 
have defined the required candidate keys with UNIQUE constraints, however.  
The normalizer used to get these solutions may leave out some of the 
constraints, but still generate 3NF schemas.  Watch out!   It is assuming 
that you can handle this outside the schema or are willing to convert the 
FD's to some sort of constraints.

If we are allowed to drop FD's (as this particular normalizer does) then 
there are actually 22 solutions (most of which are not generated by this 
normalizer).  These solutions can be found by dropping attributes or whole 
tables from the solutions above (note that each attribute must still be 
represented in at least one table).  Some of the other 17 solutions can be 
generated by:

1) Dropping either or both of the last two tables in the last three solutions

2) Dropping combinations of gate, flight, and pilot where they are not keys 
(remember to keep at least one non-key in each table and make sure if an 
attribute is dropped from one table it is still represented somewhere else).

3) Add UNIQUE constraints or indexes to the first two solutions.

Did you notice that even though the last three of the five given solutions to 
the problem still allow some anomalous states? Consider this: In solution 
three the last two tables could have day and flight combinations that are not 
part of the valid day and flight list as defined in the second table.  The 
other solutions also have integrity problems like this.

There is a normal form that fixes this for us: Domain/Key Normal Form (DKNF) 
defined by Ronald Fagin in 1981.  There is not yet a general algorithm that 
will always generate the DKNF solution given a set of constraints.  We can, 
however, determine DKNF in many special cases.  Here is our DKNF solution to 
the problem:

Solution 6:

CREATE TABLE R1
(flight, hour, destination, 
 UNIQUE (flight, hour, destination),
 UNIQUE (flight, hour), 
 UNIQUE (flight)); 

CREATE TABLE R2
(day, flight, hour, gate, pilot,  
 UNIQUE (day, flight, gate), 
 UNIQUE (day, flight, pilot),
 UNIQUE (day, flight), 
 FOREIGN KEY (flight, hour) REFERENCES R1(flight, hour));

Notice that this is a case of normalization by dropping a table and adding 
UNIQUE constraints.  The candidate key (flight, hour) may not seem necessary 
with flight also defined as a candidate key in R1.  This is done so that the 
foreign key in R2 carries the (flight, hour) combination and not just flight. 
 This way, the second relation cannot contain an invalid (flight, hour) 
combination.

Once we add in the foreign keys to solutions 3, 4  and  5, are all of the 
anomalies removed? No, not entirely.  The only solution that removes all 
anomalies is still the DKNF solution.  The best way to enforce these 
constraints is to collapse all but the first table into one.  This way 
inconsistent gate, pilot, day, hour, flight combinations cannot exist.  This 
is because with only one table to hold such a combination we cannot have the 
problem of two such tables with many overlapping attributes disagreeing.  
This is what the DKNF solution accomplishes.

02.10 Practical Hints for Normalization

CASE tools implement formal methods for doing normalization.  In particular, 
E-R (Entity-Relationship) diagrams are very useful for this.  However, a few 
informal hints can help speed up the process and give you a good start.

Broadly speaking, tables represent either entities or relationships, which is 
why E-R diagrams work so well as a design tool.  The tables that represent 
entities should have a simple, immediate name suggested by their contents -- 
a table named Students has student data in it, not student data and bowling 
scores.  It is also a good idea to use plural or collective nouns as the 
names of such tables to remind you that a table is a set of entities; the 
rows are the single instances of them.

Tables which represent many to many relationships should named by their 
contents and should be as minimal as possible.  For example, Students are 
related to Classes by a third (relationship) table for their attendance.  
These tables might represent a pure relationship or they might contain 
attributes which exist within the relationship, such as a grade for the class 
attended.  Since the only way to get a grade is to attend the class, the 
relationship is going to have a column for it, and will be names ReportCards 
or Grades.  

Avoid NULLs whenever possible.  If a table has too many NULLable columns, it 
is probably not normalized properly.  Try to use a NULL only for a value 
which is missing now, but which will be resolved later.  Even better, put 
missing values into the encoding schemes for that column, as discussed in the 
section of this book on encoding schemes.  

A normalized database will tend to have a lot of tables with a small number 
of columns per table.  Don't panic when you see that happen.  People who 
first worked with file systems (particularly on computers which used magnetic 
tape) tend to design one monster file for an application and do all the work 
against those records.  This made sense in the old days, since there was no 
reasonable way to JOIN a number of small files together without having the 
computer operator mount and dismount lots of different tapes.  The habit of 
designing this way carried over to disk systems, since the procedural 
programming languages were still the same.  

The same non-key attribute in more than one table is probably a normalization 
problem.  This is not a certainty, just a guideline.  The key that determines 
that attribute should be in only one table, and therefore the attribute 
should be with it.  

As a practical matter, you are apt to see the same attribute under different 
names and need to make the names uniform in the entire database.  The columns 
"date_of_birth", "birthdate", "birthday", and "dob" are very likely the same 
attribute of an employee.

02.11 Practical Hints for Denormalization

The subject of denormalization is a great way to get into religious wars.  At 
one extreme, you will find relational purist who think that the idea of not 
carrying a database design to at least 3NF is a crime against nature.  At the 
other extreme, you will find people who simply add and move columns all over 
the database with ALTER statements, never keeping the schema stable.  

The reason for denormalization is performance.  A fully normalized database 
requires a lot of JOINs to construct common VIEWs of data from its 
components.  JOINs are very costly in terms of time and computer resources, 
so by "pre-constructing" the JOIN in a denormalized table can save quite a 
bit.  

Consider this actual problem which appeared on CompuServe's ORACLE forum.  A 
pharmaceutical company has an inventory table, and a table of the price 
changes which look like this:

CREATE TABLE Drugs
(drug_nbr INTEGER PRIMARY KEY, 
 drug_name CHAR(30) NOT NULL, 
 quantity INTEGER NOT NULL 
          CONSTRAINT positive_quantity
          CHECK(quantity > 0), 
 ...);

CREATE TABLE Prices
(drug_nbr INTEGER NOT NULL, 
 start_date DATE NOT NULL, 
 end_date DATE NOT NULL 
         CONSTRAINT started_before_endded
         CHECK(start_date <= end_date), 
 price DECIMAL(8, 2) NOT NULL, 
 PRIMARY KEY (drug_nbr, start_date));

Every order has to use the order date to find what the selling price was when 
the order was placed.  The current price will have a value of "eternity" (a 
dummy date set so high that it will not be reached like '9999-12-31').  The 
(end_date + 1 DAY) of one price will be equal to the start_date of the next 
price for the same drug.  

While this is normalized, performance will stink.  Every report, invoice or 
query will have a JOIN between Drugs and Prices.  The trick here is to add 
more columns to the Drugs, like this:

CREATE TABLE Drugs
(drug_nbr INTEGER PRIMARY KEY, 
 drug_name CHAR(30) NOT NULL, 
 quantity INTEGER NOT NULL 
          CONSTRAINT posirive quantity
          CHECK(quantity > 0), 
 currentstartdate DATE NOT NULL, 
 currentenddate DATE NOT NULL, 
 CONSTRAINT current_start_before_endded 
 CHECK(currentstartdate <= currentenddate), 
 currentprice DECIMAL(8, 2) NOT NULL, 
 priorstartdate DATE NOT NULL, 
 priorenddate DATE NOT NULL,
 CONSTRAINT prior_start_before_endded 
 CHECK(priorstartdate <= priorenddate), 
        AND (currentstartdate = priorenddate + INTERVAL 1 DAY), 
 priorprice DECIMAL(8, 2) NOT NULL, 
 ...);

This covered over 95% of the orders in the actual company because very few 
orders are held up more than two price changes.  The odd exception was 
trapped by a procedural routine.  

Another good trick is to preserve the constraints which were implied by the 
original tables and move them into CHECK() clauses.  Consider two of the 
tables from the examples used at the start of this chapter:

CREATE TABLE Enrollment (student, course, section, grade, 
PRIMARY KEY (student, course));

CREATE TABLE Students (student, major, PRIMARY KEY (student));

If we want to combine them into a single denormalized table, we might start 
with:

CREATE TABLE StudentEnrollment 
(student, major, course, section, grade);

then add a constraint to assure that no student has exactly one major:

 CHECK (NOT EXISTS (SELECT student, COUNT(major)
                      FROM StudentEnrollment 
                     GROUP BY student
                    HAVING COUNT(major) <> 1);

and then more constraints to assure that a student is not signed up for the 
same course twice and so forth.  Yes, insertions will take time because of 
the extra checking that will done, but this is the price you pay for speed in 
doing queries.  

02.11.1 Row sorting

1925 Words

Back on 2001 May 27, Fred Block posted a problem on the SQL Server newsgroup. 
 I will change the problem slightly, but the idea was that he had a table 
with five character string columns that had to be sorted alphabetically 
within each row.  This "flatten table" is a very common denormalization, 
which might invovle months of the year as columns, or other things which are 
actign as repeating groups in violation of First Normal Form.  

Let's declare the table to look like this and dive into the problem.  

CREATE TABLE Foobar
(keycol INTEGER NOT NULL PRIMARY KEY,
 c1 VARCHAR(20) NOT NULL,
 c2 VARCHAR(20) NOT NULL,
 c3 VARCHAR(20) NOT NULL,
 c4 VARCHAR(20) NOT NULL,
 c5 VARCHAR(20) NOT NULL);

This means that we want this condition to hold:

CHECK ((c1 <= c2) AND  (c2 <= c3) 
        AND (c3 <= c4) AND (c4 <= c5))

Obviously, if he had added this constraint to the table in the first place, 
we would be fine.  Of course, that would have pushed the problem to the front 
end and I would not have a topic for this section.  

What was interesting was how everyone that read this newsgroup posting 
immediately envisioned a stored procedure that would take the five values, 
sort them and return them to their original row in the table.  The only way 
to make this approach work for the whole table was to write an update cursor 
and loop thru all the rows of the table.  Itzik Ben-Gan posted a simple 
procedure that loaded the values into a temporary table, then pulled them out 
in sorted order, starting with the minimum value, using a loop.

Another trick is the Bose-Nelson sort ("A Sorting Problem" by R.C. Bose and 
R.J. Nelson; Journal of the ACM, vol 9 pages 282-296), which I had written 
about in DR. DOBB'S JOURNAL back in 1985.  This is a recursive procedure that 
takes an integer and then generates swap pairs for a vector of that size.  A 
swap pair is a pair of position numbers from 1 to (n) in the vector which 
need to be exchanged if they are out of order.  This swap pairs are also 
related to Sorting Networks in the literature (see THE ART OF COMPUTER 
PROGRAMMING by Donald Knuth, vol 3).  

You are probably thinking that this method is a bit weak because the results 
are only good for sorting a fixed number of items.  But a table only has a 
fixed number of columns, so that is not a problem in denormalized SQL.  

You can set up a sorting network that will sort five items, with the minimal 
number of exchanges, nine swaps, like this:

swap (c1, c2);
swap (c4, c5);
swap (c3, c5);
swap (c3, c4);
swap (c1, c4);
swap (c1, c3);
swap (c2, c5);
swap (c2, c4);
swap (c2, c3);

You might want to deal yourself a hand of five playing cards in one suit to 
see how it works.  Put the cards face down on the table and pick up the 
pairs, swapping them if required, then turn over the row to see that it is in 
sorted order when you are done.  

In theory, the minimum number of swaps needed to sort (n) items is CEILING 
(log2 (n!)) and as (n) increases, this approaches O(n*log2(n)).  The Computer 
Science majors will remember that "Big O" expression as the expected 
performance of the best sorting algorithms, such as Quicksort.  The 
Bose-Nelson method is very good for small values of (n).  If (n < 9) then it 
is perfect, actually.  But as things get bigger, Bose-Nelson approaches O(n ^ 
1.585).  In English, this method is good for a fixed size list of 16 or fewer 
items and goes to hell after that.  

You can write a version of the Bose-Nelson procedure which will output the 
SQL code for a given value of (n).  The obvious direct way to do a swap() is 
to write a chain of UPDATE statements.  Remember that in SQL, the SET clause 
assignments happen in parallel, so you can easily write a SET clause that 
exchanges the two items when are out of order.  Using the above swap chain, 
we get this block of code:

BEGIN ATOMIC
-- swap (c1, c2);
UPDATE Foobar 
   SET c1 = c2, c2 = c1
 WHERE c1 > c2;

-- swap (c4, c5);
UPDATE Foobar
   SET c4 = c5, c5 = c4
 WHERE c4 > c5;

-- swap (c3, c5);
UPDATE Foobar
SET c3 = c5, c5 = c3
 WHERE c3 > c5;

-- swap (c3, c4);
UPDATE Foobar
   SET c3 = c4, c4 = c3
 WHERE c3 > c4;

-- swap (c1, c4);
UPDATE Foobar
   SET c1 = c4, c4 = c1
 WHERE c1 > c4;

-- swap (c1, c3);
UPDATE Foobar
   SET c1 = c3, c3 = c1
 WHERE c1 > c3;

-- swap (c2, c5);
UPDATE Foobar
   SET c2 = c5, c5 = c2
 WHERE c2 > c5;

-- swap (c2, c4);
UPDATE Foobar
   SET c2 = c4, c4 = c2
 WHERE c2 > c4;

-- swap (c2, c3);
UPDATE Foobar
   SET c2 = c3, c3 = c2
 WHERE c2 > c3;
    
END;

This fully portable, standard SQL code and it can be machine generated.  But 
that parallelism is useful.  It is worthwhile to combine some of the UPDATE 
statements.  But you have to be careful not to change the effective sequence 
of the swap operations. 

If you look at the first two UPDATE statements, you can see that they do not 
overlap.  This means you could roll them into one statement like this:

-- swap (c1, c2) AND swap (c4, c5);
UPDATE Foobar 
   SET c1 = CASE WHEN c1 <= c2 THEN c1 ELSE c2 END, 
       c2 = CASE WHEN c1 <= c2 THEN c2 ELSE c1 END, 
       c4 = CASE WHEN c4 <= c5 THEN c4 ELSE c5 END, 
       c5 = CASE WHEN c4 <= c5 THEN c5 ELSE c4 END
 WHERE c4 > c5 OR c1 > c2;

The advantage of doing this is that you have to execute only one UPDATE 
statement and not two.  Updating a table, even on non-key columns, usually 
locks the table and prevents other users from getting to the data.  If you 
could roll the statements into one single UPDATE, you would have the best of 
all possible worlds, but I doubt that the code would be easy to read.  

We can see this same pattern in the pair of statements.  

swap (c1, c3); 
swap (c2, c5);

But there are other patterns, so you can write general templates for them.  
Consider this one:

swap (x, y);
swap (x, z);

If you write out all possible triplets and apply these two operations on 
them, thus.

(x, y, z) => (x, y, z)
(x, z, y) => (x, z, y)
(y, x, z) => (x, y, z)
(y, z, x) => (x, z, y)
(z, x, y) => (x, y, z) 
(z, y, x) => (x, y, z) 

The result of this pattern is that x is lowest value of the three values, and 
y and z either stay in the same relative position to each other or get sorted 
properly.  Getting them properly sorted would have the advantage of saving 
exchanges later and also reducing the set of the subset being operated upon 
by each UPDATE statement.  With a little thought, we can write this symmetric 
piece of code. 

-- swap (x, y) AND swap (x, z);
UPDATE Foobar 
   SET x = CASE WHEN x BETWEEN y AND z THEN y
                WHEN z BETWEEN y AND x THEN y
                WHEN y BETWEEN z AND x THEN z
                WHEN x BETWEEN z AND y THEN z
                ELSE x END, 
       y = CASE WHEN x BETWEEN y AND z THEN x
                WHEN x BETWEEN z AND y THEN x
                WHEN z BETWEEN x AND y THEN z
                WHEN z BETWEEN y AND x THEN z
                ELSE y END, 
       z = CASE WHEN x BETWEEN z AND y THEN y
                WHEN z BETWEEN x AND y THEN y
                WHEN y BETWEEN z AND x THEN x
                WHEN z BETWEEN y AND x THEN x
                ELSE z END
 WHERE x > z OR x > y;

While it is very tempting to write more and more of these pattern templates, 
it might be more trouble than it is worth because of increased maintenance 
and readability.  

Here is a 'C' program for the Bose-Nelson sort as given in THE C/C++ USER'S 
JOURNAL (1993 February issue, "sorting Networks" by Frederick Hegeman).

Bose(n) int n;
/ * Calling Bose(n) generates a network
  to sort n items. See R. C. Bose & R. J. Nelson,
 "A Sorting Problem", JACM Vol. 9, Pp. 282-296. */

{ Pstar(1, n); /* sort the sequence (X1,...,Xn) */
}

P(i, j) int i, j;
{ printf("swap(%d, %d);\n", i, j); }

Pstar(i, m)
int i;  /* value of first element in sequence */
int m;  /* length of sequence */
{ int a;
  if(m > 1)
    { /* Partition into 2 shorter sequences,
       * generate a sorting method for each,
       * and merge the two sub-networks. */
    a = m/2;
    Pstar(i, a);
    Pstar((i + a), (m - a));
    Pbracket(i, a, (i + a), (m - a));
    }
}

Pbracket(i, x, j, y)
int i;  /* value of first element in sequence 1 */
int x;  /* length of sequence 1 */
int j;  /* value of first element in sequence 2 */
int y;  /* length of sequence 2 */
{ int a, b;

  if(x == 1 && y == 1)
    P(i, j); /* 1 comparison sorts 2 items */
  else if(x == 1 && y == 2)
       { /* 2 comparisons inserts an item into an
          * already sorted sequence of length 2. */
        P(i, (j + 1));
        P(i, j);
        }
       else if(x == 2 && y == 1)
              { /* As above, but inserting j */
               P(i, j);
               P((i + 1), j);
               }
            else  { 
/* Recurse on shorter sequences, attempting
 to make the length of one subsequence odd
 and the length of the other even. If we
 can do this, we eventually merge the two. */
                   a = x/2;
                   b = (x & 1) ? (y/2) : ((y + 1)/2);
                   Pbracket(i, a, j, b);
                   Pbracket((i + a), (x - a), (j + b), (y - b));
                   Pbracket((i + a), (x - a), j, b);
                   }
} /* End of Bose */


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]


Powered by eList eXpress LLC