OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

ubl-ndrsc message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]


Subject: Re: [ubl-ndrsc] Functional Dependency and Normalization


whew indeed! a truly awesome effort bill (and thank you joe for your input). 

I am currently writing a briefing paper for the next LC/NDR meeting on 'container types of grouped elements based on functional dependency' (or some snappier title if i can think of one). in particular, i am trying to provide a pragmatic view for those of us who want to experience salvation without the full religous experience of conversion to believe in the one true Codd.  I dont want people to think that until they have gone through all this material they will not be able to apply the concepts.  As you say, when we scratch the surface we see how modeling data structures is neither new nor obscure.

If you don't mind, I will incorporate some of your thinking into my briefing paper.   I specifically like the way you categorize the similarities and differences between database design and document schema design.



Burcham, Bill wrote:
40AC2C8FB855D411AE0200D0B7458B2B073455F4@scidalmsg01.csg.stercomm.com">
The theory of functional dependency and normalization has been around a long time and is widely and deeply covered in the literature.  The Library Content Methodology position paper proposes that we apply this theory to the modeling effort in UBL.  Previously, there seemed to be little consensus that this traditional data modeling theory applied to UBL. 
 
Near the end of today's joint call , however, there seemed to be a shift in sentiment.  What I am now hearing is that a) there is a feeling that we need some rules to help us decide when a new Class or ABIE is needed (or when a couple Classes/ABIE's should be collapsed into one) and b) since functional dependency and normalizatoin theory is the only proposal on the table, that folks might be interested in hearing more about what it actually is and how to apply it.
 
To that end I want to augment the presentation in the Library Content Methodology position paper with some deeper coverage from other sources.  Feedback from many early reviewers of the paper said that the normalization discussion didn't make sense to them, or that they failed to see the relevance to our work.  My hope is that by providing a couple more perspectives on the same topic we can together generate a shared understanding of it, and use that understanding to clearly formulate a "bridging theory" -- tying normalizaton theory to the task at hand in a way we all perceive as valuable.
 
Joe Celko , author of SQL For Smarties has graciously agreed to provide the attached material on functional dependency and normalization theory.  It takes some work to get through it, but it's worth the effort.  I chose Joe's work because it strikes a nice balance between formalism and approachability.  Thank you Joe Celko!
 
Ullman and Widom's  A First Course in Database Systems  also has an excellent, understandable, but more formal presentation on the subject.  Also, Chapter 4 Other Data Models describes pre-relational data models.  Looking at that I'm struck by the impression that the old hierarchical database model is a lot like that of an XML document instance, and that the old network database model is similar to an XML document instance with ID and IDREF (or XLink).  In the 2nd ed of the book, they actually discuss XML.  Unfortunately, I've only got the 1st ed.
 
While the authors of A First Course... are unable to provide us with electronic copy of their work, there are slides and course notes available online.  If you've read the book, then the online stuff makes sense.  Unfortunately, if you're new to the subject the online material isn't all that useful.  On the upside, Jeff Ullman has expressed willingness of his group to provide us tutorial assistance (live).  Thank you Jeff Ullman!  Depending upon our group's response and needs I can coordinate such a tutorial.
 
Here is an outline of the pertinent sections of A First Course... (included here since it isn't included on the Amazon page):
 
3.4 Functional Dependencies
3.4.1 Definition of Functional Dependency
3.4.2 Keys of Relations
3.4.3 Superkeys
3.4.4 Discovering Keys for Relations
3.4.5 Exercises for Section 3.4
3.5 Rules About Functional Dependencies
3.5.1 The Splitting/Combining Rule
3.5.2 Trivial Functional Dependencies
3.5.3 Computing the Closure of Attributes
3.5.4 Why the Closure Algorithm Works
3.5.5 The Transitive Rule
3.5.6 Closing Sets of Functional Dependencies
3.5.7 Projecting Functional Dependencies
3.5.8 Exercises for Section 3.5
3.6 Design of Relational Database Schemas
3.6.1 Anomalies
3.6.2 Decomposing Relations
3.6.3 Boyce-Codd Normal Form
3.6.4 Decomposition into BCNF
3.6.5 Recovering Information from a Decomposition
3.6.6 Third Normal Form
3.6.7 Exercises for Section 3.6
3.7 Multivalued Dependencies
3.7.1 Attribute Independence and Its Consequent Redundancy
3.7.2 Definition of Multivalued Dependencies
3.7.3 Reasoning About Multivalued Dependencies
3.7.4 Fourth Normal Form
3.7.5 Decomposition into Fourth Normal Form
3.7.6 Relationships Among Normal Forms
 
You'll find that Celko covers much of the same ground as Ullman and Widom though with less depth and rigor.  Each has value.
 
I just read the attachment.  There are a number of areas where a mapping is required from the relational model to our hierarchical, single-document one. 
 
One example is the discusson around Multivalued Dependencies or MVD's.  This is a big deal in the relational model, whereas in the XML one we (almost) always just automatically use an element with numOccurs='unbounded'. 
 
Another example is that the arguments around "data anomalies" (insert, update, delete anomalies) don't  seem pertinent to our situation where no one "inserts" into a business document.  Rather, we form the business document all at once.  Same issue for updates -- we don't update a business document: we form it and we send it.  I believe that we can motivate the applicability of the theory for our situation, but I believe that to do so we will have to rely on arguments not about data anomalies but about maintainability of processing code (e.g. stylesheets) and about specialization of the vocabulary (specializations, ability to apply context methodology )
 
Celko perhaps alludes to this sort of justification when in the section on 1st Normal Form he says:

"Yes, there are some ways to get around these (normalization) problems without changing the tables. We could permit NULLs in the table. We could write routines to check the table for false data. These are tricks that will only get worse as the data and the relationships become more complex. The solution is to break the table up into other tables, each of which represents one relationship or simple fact. "

The implication here is that while it's true that normalization neutralizes data anomalies, and the value of that to UBL may be difficult to see, normalization also contributes to a situation where the model and not the processing logic enforces to a great extent, the consistent structure of the data.  And I think the value of that is easier to see.
 
It's also interesting that in a number of places in the document, Celko points out that the presence of NULL's in data (analogous to optionality in the XML schema) is a red flag -- it usually denotes poor normalization.  It was precisely this analogue (NULL's in data imply poor normalization and that's analogous to optionality in the XML schema) that struck me in Barcelona when someone commented in the plenary that a particular UBL model seemed awfully "flat" and that "everything was optional".

"Avoid NULLs whenever possible. If a table has too many NULLable columns, it is probably not normalized properly. Try to use a NULL only for a value which is missing now, but which will be resolved later. Even better, put missing values into the encoding schemes for that column, as discussed in the section of this book on encoding schemes."

Also, I like this blurb that echoes Jon Bosak's RosettaNet user group annecdote today (they demanded smaller content models):

"A normalized database will tend to have a lot of tables with a small number of columns per table. Don't panic when you see that happen. People who first worked with file systems (particularly on computers which used magnetic tape) tend to design one monster file for an application and do all the work against those records. This made sense in the old days, since there was no reasonable way to JOIN a number of small files together without having the computer operator mount and dismount lots of different tapes. The habit of designing this way carried over to disk systems, since the procedural programming languages were still the same.  "

So let those ABIE's/Classes/Relations proliferate!  Let them be small and cohesive -- and let the users go forth and be happy.
 
Finally, here's a mapping of terms between the two, er, four really, worlds.  It ain't perfect, but it's pretty darn good.
 
Pure Relational Model Relational Database Model CC 1.80 XML
relation table ABIE complex type definition
attribute column property XSD (local) element declaration
tuple row instance? element content (in instance document)
NULLABLE NULLABLE N/A XSD minOccurs='0'
NULL NULL N/A (element not present in instance document)
MVD MVD N/A XSD maxOccurs='unbounded'
 
Whew!  'nuff said.
 

 

Bill Burcham
Sr. Software Architect, Standards and Applied Technology
Sterling Commerce , Inc.
469.524. 2164
bill_burcham@stercomm.com

 






Working draft of Chapter 2 of Joe Celko's SQL For Smarties.

02. Normalization

The Relational Model and the Normal Forms of the Relational Model were first
defined by Dr. E. F. Codd (Codd 1970), then extended by other writers after
him. He invented the term "normalized relations" by borrowing from the
political jargon of the day. A branch of mathematics called relations deals
with mappings among sets defined by predicate calculus from formal logic.
Just as in an algebraic equation, there are many forms of the same relational
statement, but the "normal forms" of relations are certain formally defined
desirable constructions. The goal of normal forms is to avoid certain data
anomalies that can occur in unnormalized tables. Data anomalies are easier
to explain with an example, but first please be patient while I define some
terms.A predicate is a statement of the form A(X), which means that X has the
property A. For example, "John is from Indiana" is a predicate statement;
here, "John" is the subject and "is from Indiana" is the predicate. A
relation is a predicate with two or more subjects. "John and Bob are
brothers" is an example of a relation. The common way of visualizing a set
of relational statements is as a table where the columns are attributes of
the relation and each row is a specific relational statement.When Dr. Codd
defined the relational model, he gave 12 rules for the visualization of the
relation as a table:

0. (Yes, there is a rule zero.) For a system to qualify as a relational
database management system, that system must use its relational facilities
(exclusively) to manage the database. SQL is not so pure on this rule, since
you can often do procedural things to the data.

1. The information rule: This simply requires all information in the
database to be represented in one and only one way, namely by values in
column positions within rows of tables. SQL is good here.

02. The guaranteed access rule: This rule is essentially a restatement of
the fundamental requirement for primary keys. It states that every
individual scalar value in the database must be logically addressable by
specifying the name of the containing table, the name of the containing
column, and the primary key value of the containing row. SQL follows this
rule for tables that have a primary key, but does not require a table to have
a key at all.

3. Systematic treatment of NULL values: The DBMS is required to support a
representation of missing information and inapplicable information that is
systematic, distinct from all regular values, and independent of datatype.
It is also implied that such representations must be manipulated by the DBMS
in a systematic way. SQL has a NULL that is us ed for both missing
information and inapplicable information, rather than having two separate
tokens as Dr. Codd wished.

4. Active online catalog based on the relational model: The system is
required to support an online, inline, relational catalog that is accessible
to authorized users by means of their regular query language. SQL does
this.

5. The comprehensive data sublanguage rule: The system must support at least
one relational language that (a) has a linear syntax, (b) can be used both
interactively and within application programs, and (c) supports data
definition operations (including view definitions), data manipulation
operations (update as well as retrieval), security and integrity constraints,
and transaction management operations (begin, commit, and rollback). SQL is
pretty good on this point, since all of the operations Codd defined can be
written in the DML (Data Manipulation Language).
< br>6. The view updating rule: All views that are theoretically updatable must
be updatable by the system. SQL is weak here, and has elected to standardize
on the safest case. View updatability is a very complex problem.

7. High-level insert, update, and delete: The system must support
set-at-a-time INSERT, UPDATE, and DELETE operators. SQL does this.

8. Physical data independence: This is self-explanatory. Any real product
is going to have some physical dependence, but SQL is better than most
programming languages on this point.

9. Logical data independence: This is self-explanatory. SQL is quite good
about this point.

10. Integrity independence: Integrity constraints must be specified
separately from application programs and stored in the catalog. It must be
possible to change such constraints as and when appropriate without
unnecessarily affecting existing applications. SQL-92 has this.

11. Distribution independence: Existing applications should continue to
operate successfully (a) when a distributed version of the DBMS is first
introduced and (b) when existing distributed data is redistributed around the
system. We are just starting to get distributed versions of SQL, so it is a
little early to say whether SQL will meet this criterion or not.

12. The non-subversion rule: If the system provides a low-level
(record-at-a-time) interface, that interface cannot be used to subvert the
system (e.g., bypassing a relational security or integrity constraint).
SQL-92 is good about this one.

Codd also specified 9 structural features, 3 integrity features, and 18
manipulative features, all of which are required as well. He later extended
the list from 12 rules to 333 in the second version of the relational model.
This section is getting too long and you can look them up for yourself.

Normal fo rms are an attempt to make sure that you do not destroy true data or
create false data in your database. One of the ways of avoiding errors is to
represent a fact only once in the database, since if a fact appears more than
once, one of the instances of it is likely to be in error -- a man with two
watches can never be sure what time it is.

This process of table design is called normalization. It is not mysterious,
but it can get complex. You can buy CASE tools to help you do it, but you
should know a bit about the theory before you use such a tool.

02.1 Functional and Multivalued Dependencies

A normal form is a way of classifying a table based on the functional
dependencies (FDs for short) in it. A functional dependency means that if I
know the value of one attribute, I can always determine the value of another.
The notation used in relational theory is an arrow between the two
attributes, for example A -& gt; B, which can be read in English as "A determines
B". If I know your employee number, I can determine your name; if I know a
part number, I can determine the weight and color of the part; and so forth.

A multivalued dependency (MVD) means that if I know the value of one
attribute, I can always determine the values of a set of another attribute.
The notation used in relational theory is a double-headed arrow between the
two attributes, for instance A .. B, which can be read in English as "A
determines many Bs". If I know a teacher's name, I can determine a list of
her students; if I know a part number, I can determine the part numbers of
its components; and so forth.

02.2 First Normal Form (1NF)

Consider a requirement to maintain data about class schedules. We are
required to keep the course, section, department, time, room, professor,
student, major, and grade. Suppose that we initially set up a Pascal file
with records that look like this:

Classes = RECORD
course: ARRAY [1:7] OF CHAR;
section: CHAR;
time: INTEGER;
room: INTEGER;
roomsize: INTEGER;
professor: ARRAY [1:25] OF CHAR;
department: ARRAY [1:10] OF CHAR;
Students : ARRAY [1:classsize]
OF RECORD
student ARRAY [1:25] OF CHAR;
major ARRAY [1:10] OF CHAR;
grade CHAR;
END;
END;

This table is not in the most basic normal form of relational databases.
First Normal Form (1NF) means that the table has no repeating groups. That
is, every column is a scalar (or atomic) value, not an array or a list or
anything with its own structure. In SQL, it is impossible not to be in 1NF
unless the vendor has added array or other extensions to the language. The
Pascal reco rd could be "flattened out" in SQL to look like this:

CREATE TABLE Classes
(course CHAR(7) NOT NULL,
section CHAR(1) NOT NULL,
time INTEGER NOT NULL,
room INTEGER NOT NULL,
roomsize INTEGER NOT NULL,
professor CHAR(25) NOT NULL,
department CHAR(10) NOT NULL,
student CHAR (25) NOT NULL,
major CHAR(10) NOT NULL,
grade CHAR(1) NOT NULL);

This table is acceptable to SQL. In fact, we can locate a row in the table
with a combination of (course, section, student), so we have a key. But what
we are doing is hiding the Students record array, which has not changed its
nature by being flattened. There are problems.

If Professor Jones of the math department dies, we delete all his rows from
the Classes table. This also deletes the information that all his students
were taking a math class and maybe not all of them wanted to drop out of the
class just yet. I am deleting more than one fact from the database. This is
called a deletion anomaly.

If student Wilson decides to change one of his math classes, formerly taught
by Professor Jones, to English, we will show Professor Jones as an instructor
in both the math and the English departments. I could not change a simple
fact by itself. This creates false information, and is called an update
anomaly.

If the school decides to start a new department, which has no students yet,
we cannot put in the data about the professors we just hired until we have
classroom and student data to fill out a row. I cannot insert a simple fact
by itself. This is called an insertion anomaly.

There are more problems in this table, but you see the point. Yes, there are
some ways to get around these problems without changing the tables. We could
permit NULLs in the table. We could write routines to check the table for
false data. These are tricks that will only get worse as the data and the
relationships become more complex. The solution is to break the table up
into other tables, each of which represents one relationship or simple fact.

02.2.1 Note on Repeated Groups

The definition of 1NF is that the table has no repeating groups and that all
columns are scalar. This means no arrays, linked lists, tables within
tables, or record structures, like those you would find in other programming
languages. As I have already said, this is very easy to avoid in SQL, since
arrays and structured data are simply not supported.

The way you "fake it" in SQL is to use a group of columns which all the
members of the group have the same semantic value; that is, they represent
the same attribute in the table. Consider the table of an employee and his
children:

CREATE TABLE Employees
(empno INTEGER NOT NULL,
empname CHAR(30) NOT NULL,
...
child1 CHAR(30), birthday1 DATE, sex1 CHAR(1),
child2 CHAR(30), birthday2 DATE, sex2 CHAR(2),
child3 CHAR(30), birthday3 DATE, sex3 CHAR(1),
child4 CHAR(30), birthday4 DATE, sex4 CHAR(1));

This looks like the layouts of many existing file system records in Cobol and
other 3GL languages. The birthday and sex information for each child is part
of a repeated group and therefore violates 1NF. This is faking a
four-element array in SQL! Most books simply stop at this point and never
bother to explain why this is good or bad; we will go into detail in chapter
23, on arrays.

Suppose I have a table with the quantity of a product sold in each month of a
particular year and I originally built the table to look like this:

CREATE TABLE Abnormal
(product CHAR(10) NOT NULL PRIMARY KEY,
bin_01 INTEGER,
bin_02 INTEGER,
...
bin_12 INTEGER);

and I wanted to flatten it out into a more normalized form, like this:

CREAT E TABLE Normal
(product CHAR(10) NOT NULL,
bin_nbr INTEGER NOT NULL,
qty INTEGER NOT NULL,
PRIMARY KEY (product, bin_nbr));

I can use the statement

INSERT INTO Normal
SELECT product, 1, bin_01
FROM Abnormal
WHERE bin_01 IS NOT NULL
UNION ALL
SELECT product, 2, bin_02
FROM Abnormal
WHERE bin_02 IS NOT NULL

UNION ALL
SELECT product, 12, bin_12
FROM Abnormal
WHERE bin_12 IS NOT NULL;

While a UNION ALL query is usually slow, this has to be run only once to load
the normalized table and then the original table can be dropped.

02.3 Second Normal Form (2NF)

A table is in Second Normal Form (2NF) if it has no partial key dependencies.
That is, if X and Y are columns and X is a key, then for any Z that is a
proper subset of X, it cannot be the case that Z -> Y. Informally, the table
is in 1NF and it has a key that determines all non-key attributes in the table.

In the example, our users tell us that knowing the student and course is
sufficient to determine the section (since students cannot sign up for more
than one section of the same course) and the grade. This is the same as
saying that (student, course) -> (section, grade).

After more analysis, we also discover from our users that (student -> major)
-- students have only one major. Since student is part of the (student,
course) key, we have a partial key dependency! This leads us to the
following decomposition:

CREATE TABLE Classes
(course CHAR(7) NOT NULL,
section CHAR(1) NOT NULL,
time INTEGER NOT NULL,
room INTEGER NOT NULL,
roomsize INTEGER NOT NULL,
professor CHAR(25) NOT NULL,
PRIMARY KEY (course, section));

CREATE TABLE Enrollment
(student CHAR (25) NOT NULL,
course CHAR(7) NOT NULL,
section CHAR(1) NOT NULL,
grade CHAR(1) NOT NULL,
PRIMARY KEY (studen t, course));

CREATE TABLE Students
student CHAR (25) NOT NULL PRIMARY KEY,
major CHAR(10) NOT NULL);

At this point, we are in 2NF. Every attribute depends on the entire key in
its table. Now if a student changes majors, it can be done in one place.
Furthermore, a student cannot sign up for different sections of the same
class, because we have changed the key of Enrollment. Unfortunately, we
still have problems.

Notice that while roomsize depends on the entire key of Classes, it also
depends on room. If the room is changed for a course and section, we may
also have to change the roomsize, and if the room is modified (we knock down
a wall), we may have to change roomsize in several rows in Classes for that
room.

02.4 Third Normal Form (3NF)

Another normal form can address these problems. A table is in Third Normal
Form (3NF) if for all X -> Y, where X and Y are columns of a table, X is a < br>key or Y is part of a candidate key. (A candidate key is a unique set of
columns that identify each row in a table; you cannot remove a column from
the candidate key without destroying its uniqueness.) This implies that the
table is in 2NF, since a partial key dependency is a type of transitive
dependency. Informally, all the non-key columns are determined by the key,
the whole key, and nothing but the key.

The usual way that 3NF is explained is that there are no transitive
dependencies. A transitive dependency is a situation where we have a table
with columns (A, B, C) and (A -> B) and (B -> C), so we know that (A -> C).
In our case, the situation is that (course, section) -> room and room ->
roomsize. This is not a simple transitive dependency, since only part of a
key is involved, but the principle still holds. To get our example into 3NF
and fix the problem with the roomsize column, we make the f ollowing
decomposition:

CREATE TABLE Rooms
(room INTEGER NOT NULL PRIMARY KEY,
roomsize INTEGER NOT NULL);

CREATE TABLE Classes
(course CHAR(7) NOT NULL,
section CHAR(1) NOT NULL,
time INTEGER NOT NULL,
room INTEGER NOT NULL,
PRIMARY KEY (course, section));

CREATE TABLE Enrollment
(student CHAR (25) NOT NULL,
course CHAR(7) NOT NULL,
section CHAR(1) NOT NULL,
grade CHAR(1) NOT NULL,
PRIMARY KEY (student, course));

CREATE TABLE Students
student CHAR (25) NOT NULL PRIMARY KEY,
major CHAR(10) NOT NULL);

A common misunderstanding about relational theory is that 3NF has no
transitive dependencies. As indicated above, if X -> Y, X does not have to
be a key if Y is part of a candidate key. We still have a transitive
dependency in the example -- (room, time) -> (course, section) -- but since
the right side of the dependency is a key, it is technically in 3NF. T he
unreasonable behavior that this table structure still has is that several
courses can be assigned to the same room at the same time.

Another form of transitive dependency is a computed column. For example:

CREATE TABLE Stuff
(width INTEGER NOT NULL,
length INTEGER NOT NULL,
height INTEGER NOT NULL,
volume INTEGER NOT NULL
CHECK (width * length * height = volume),
PRIMARY KEY (width, length, height));

The volume column is determined by the other three columns, so any change to
one of the three columns will require a change to the volume column. You can
use a VIEW to create computed columns.

02.5 Case Tools for Normalization

Third Normal Form is very popular with CASE tools and most of them can
generate a schema where all of the tables are in 3NF. They obtain the FDs
from an E-R (entity-relationship) diagram or from a statistical analysis of
the existing data, then put them together into tables and check for normal
forms. It is often possible to derive more than one 3NF schema from a set of
FDs. A good CASE tool will find more than one of them, and ideally will find
the highest possible normal form schemas too. Yes, there are still more
normal forms we have not mentioned yet. Nobody said this would be easy.Some
people will argue that it is all right to "denormalize" a schema for reasons
of efficiency. For example, to get a list of all the students and their
majors in a given class, we must JOIN Enrollment and Students. The case for
leaving the solution normalized is based on reducing the programming
requirement. If we denormalize, either we are not enforcing some of the
user-required constraints or we have to write additional code to enforce the
constraints.

Not enforcing all the rules is obviously bad. If we choose to write
additional code, we have to be sure to duplicate it in a ll of the programs
that work with the DBMS. Normalization reduces programming.

02.6 Boyce-Codd Normal Form (BCNF)

A table is in BCNF when for all nontrivial FDs (X -> A), X is a superkey for
the whole schema. A superkey is a unique set of columns that identify each
row in a table, but you can remove some columns from it and it will still be
a key. Informally, a superkey is carrying extra weight.

BCNF is the normal form that actually removes all transitive dependencies. A
table is in BCNF if for all (X -> Y), X is a key -- period. We can go to
this normal form just by adding another key with UNIQUE (room, time)
constraint clause to the table Classes. There are some other interesting and
useful "higher" normal forms, but they are outside of the scope of this
discussion. In our example, we have removed all of the important anomalies
with BCNF.Third Normal Form was concerned with the relationship between key
and non-key columns. However, a column can often play both roles. Consider
a table for computing salesmen's bonus gifts that has for each salesman his
base salary, the number of sales points he has won in a contest, and the
bonus gift awarded for that combination of salary range and points. For
example, we might give a fountain pen to a beginning salesman with a base pay
rate somewhere between $15,000 and $20,000 and 100 sales points, but give a
car to a master salesman, whose salary is between $30,000 and $60,000 and who
has 200 points. The functional dependencies are, therefore,

(paystep, points) -> gift
gift -> points

Let's start with a table that has all the data in it and normalize it.

Gifts
salary points gift
==========================
$15,000 100 Pencil
$17,000 100 Pen
$30,000 200 Car
$31,000 200 Car
$32,000 200 Car

This schema is i n 3NF, but it has problems. You cannot insert a new gift
into our offerings and points unless we have a salary to go with it. If you
remove any sales points, you lose information about the gifts and salaries
(e.g., only people in the $30,000 range can win a car). And, finally, a
change in the gifts for a particular point score would have to affect all the
rows within the same pay step. This table needs to be broken apart into two
tables:

PayGifts
salary gift
===============
$15,000 Pencil
$17,000 Pen
$30,000 Car
$31,000 Car
$32,000 Car

GiftsPoints
gift points
===============
Pencil 100
Pen 100
Car 200

02.7 Fourth Normal Form (4NF)

Fourth Normal Form (4NF) makes use of multivalued dependencies. The problem
it solves is that the table has too many of them. For example, consider a
table of departments, their projects, and the parts they stock. The MVD's in
the table would be:

department ->> projects

department ->> parts

Assume that department d1 works on jobs j1, and j2 with parts p1 and p2; that
department d2 works on jobs j3, j4, and j5 with parts p2 and p4; and that
department d3 works on job j2 only with parts p5 and p6. The table would
look like this:

department job part
=============
d1 j1 p1
d1 j1 p2
d1 j2 p1
d1 j2 p2
d2 j3 p2
d2 j3 p4
d2 j4 p2
d2 j4 p4
d2 j5 p2
d2 j5 p4
d3 j2 p5
d3 j2 p6

If you want to add a part to a department, you must create more than one new
row.

Likewise, to remove a part or a job from a row can destroy information.
Updating a part or job name will also require multiple rows to be changed.

The solution is to split this table into two tables, one with (department,
projects) in it and one with (department, p arts) in it. The definition of
4NF is that we have no more than one MVD in a table. If a table is in 4NF,
it is also in BCNF.

02.8 Fifth Normal Form (5NF)

Fifth Normal Form (5NF), also called the Join-Projection Normal Form or the
Projection-Join Normal Form, is based on the idea of a lossless JOIN or the
lack of a join-projection anomaly. This problem occurs when you have an
n-way relationship, where n > 2. A quick check for 5NF is to see if the
table is in 3NF and all the candidate keys are single columns.As an example
of the problems solved by 5NF, consider a table of house notes that records
the buyer, the seller, and the lender:

HouseNotes
buyer seller lender
==================================
Smith Jones National Bank
Smith Wilson Home Bank
Nelson Jones Home Bank

This table is a three-way relationship, but because many CASE tools allow
only binary relati onships it might have to be expressed in an E-R diagram as
three binary relationships, which would generate CREATE TABLE statements
leading to these tables:

BuyerLender
buyer lender
=============================
Smith National Bank
Smith Home Bank
Nelson Home Bank

SellerLender
seller lender
=======================
Jones National Bank
Wilson Home Bank
Jones Home Bank

BuyerSeller
buyer seller
================
Smith Jones
Smith Wilson
Nelson Jones

The trouble is that when you try to assemble the original information by
joining pairs of these three tables together, thus:

SELECT BS.buyer, SL.seller, BL.lender
FROM BuyerLender AS BL,
SellerLender AS SL,
BuyerSeller AS BS
WHERE BL.buyer = BS.buyer
AND BL.lender = SL.lender
AND SL.seller = BS.seller;

you will recreate all the valid rows in the orig inal table, such as ('Smith',
'Jones', 'National Bank'), but there will also be false rows, such as
('Smith', 'Jones', 'Home Bank'), which were not part of the original table.
This is called a join-projection anomaly.

There are also strong JPNF and overstrong JPNF, which make use of JOIN
dependencies (JD for short). Unfortunately, there is no systematic way to
find a JPNF or 4NF schema, because the problem is known to be NP complete.

02.9 Domain-Key Normal Form (DKNF)

A functional dependency has a defined system of axioms that can be used in
normalization problems. These six axioms, known as Armstrong's axioms, are
given below:

Reflexive: X -> X
Augmentation: if X -> Y then XZ -> Y
Union: if (X -> Y and X -> Z) then X -> YZ
Decomposition: if X -> Y and Z a subset of Y, then X -> Z
Transitivity: if (X -> Y and Y -> Z) then X -> Z
Pseudotransitivity: if (X -> Y an d YZ -> W) then XZ -> W

They make good sense if you just look at them, which is something we like in
a set of axioms. In the real world, the FDs are the business rules we are
trying to model.

In the normalization algorithm for 3NF (developed by P. A. Berstein, 1976)
we use the axioms to get rid of redundant FD's. For example, if we are given:

A -> B
A -> C
B -> C
DB -> E
DAF -> E

A -> C is redundant because it can be derived from A -> B and B -> C with
transitivity. Also DAF -> E is redundant because it can be derived from DB
-> E and A -> B with transitivity (which gives us DA -> E) and augmentation
(which then allows DAF -> E). What we would like to find is the smallest set
of FDs from which we can generate all of the given rules. This is called a
non-redundant cover. For the FD's above, one cover would be:

A -> B
B -> C
DB -> E< br>
Once we do this Berstein shows that we can just create a table for each of
the FD's where A, B, and DB are the respective keys. We have taken it easy
so far but now it's time for a challenge.

As an example of a schema with more than one Third Normal Form (3NF) schema,
here is a problem that was used in a demonstration by DBStar Corporation (San
Francisco, CA). The company uses it as an example in a demonstration which
comes with their CASE tool.

We are given an imaginary and simplified airline which has a database for
scheduling flights and pilots. Most of the relationships are obvious things.
Flights have only one departure time and one destination. They can get a
different pilot and can be assigned to a different gate each day of the week.
The functional dependencies for the database are given below:

1) flight -> destination
2) flight -> hour
3) (day, flight) -> gate
4) (day, flight) - > pilot
5) (day, hour, pilot) -> gate
6) (day, hour, pilot) -> flight
7) (day, hour, pilot) -> destination
8) (day, hour, gate) -> pilot
9) (day, hour, gate) -> flight
10) (day, hour, gate) -> destination

Your problem is to find 3NF database schemas in these FD's. You have to be
careful! You have to have all of the columns, obviously, but your answer
could be in 3NF and still ignore some of the FD's. For example, this will
not work:

CREATE TABLE PlannedSchedule
(flight, destination, hour, PRIMARY KEY (flight));

CREATE TABLE ActualSchedule
(day, flight, gate, pilot, PRIMARY KEY (day, flight));

If we apply the Union axiom to some of the FD's, we get:

(day, hour, gate) -> (destination, flight, pilot)
(day, hour, pilot) -> (destination, flight, gate)

This says that the user has required that if we are given a day, an hour, and
a gate we should be able to determine a unique flight for that day, hour, and
gate. We should also be able to determine a unique flight given a day, hour,
and pilot.

Given the PlannedSchedule and ActualSchedule tables, you cannot produce
views where either of the two constraints we just mentioned are enforced. If
the query "What flight does pilot X have on day Y and hour Z?" gives you more
than one answer, it violates the FD's. Here is an example of a schema which
is allowable in this proposed schema which is undesirable given our
constraints:

PlannedSchedule
flight hour destination
=========================
118 17:00 Dallas
123 13:00 Omaha
155 17:00 Los Angeles
171 13:00 New York
666 13:00 Dis

ActualSchedule
day flight pilot gate
========================
Wed 118 Tom 12A
Wed 155 Tom 13B
Wed 171 Tom 12A
Thu 123 John 12A
Thu 155 John 12A
Thu 171 John 13B

The constraints mean that we should be able to find a unique answer to each
the following questions and not lose any information when inserting and
deleting data.

1) Which flight is leaving form gate 12A on Thursdays at 13:00 Hrs? This
looks fine until you realize that you don't know about flight 666, which was
not required to have anything about its day or pilot in the ActualSchedule
table. And likewise, I can add a flight to the ActualSchedule table that has
no information in the PlannedSchedule table.

2) Which pilot is assigned to the flight that leaves gate 12A on Thursdays at
13:00 Hrs? This has the same problem as before.

3) What is the destination of the flight in query 1 and 2? This has the same
problem as before.

4) What gate is John leaving from on Thursdays at 13:00 Hrs?

5) Where is Tom flying to on Wednesdays at 17:00 Hrs?

6) What flight is assigned to Tom on Wednesdays at 17:00 Hrs?


It might help if we gave an example of how one of the FD's in the problem can
be derived using the axioms of FD calculus, just like you would do a geometry
proof:

Given:
1) (day, hour, gate) -> pilot
2) (day, hour, pilot) -> flight

prove that:
(day, hour, gate) -> flight.

3) (day, hour) -> (day, hour); Reflexive
4) (day, hour, gate) -> (day, hour); Augmentation on 3
5) (day, hour, gate) -> (day, hour, pilot); Union 1 & 4
6) (day, hour, gate) -> flight; Transitive 2 and 5
Q.E.D

The answer is to start by attempting to derive each of the functional
dependencies from the rest of the set. What we get is several short proofs,
each requiring different "given" functional dependencies in order to get to
the derived FD.

Here is a list of each of the proofs used to derive the ten fragmented FD's
in the problem. With each derivatio n we include every derivation step and
the legal FD calculus operation that allows me to make that step. An
additional operation that we include here which was not included in the
axioms we listed earlier is left reduction. Left reduction says that if XX
-> Y then X -> Y. The reason it was not included is that this is actually a
theorem and not one of the basic axioms (side problem: can you derive left
reduction?).

Prove: (day, hour, pilot) -> gate
a) day -> day; Reflexive
b) (day, hour, pilot) -> day; Augmentation (a)
c) (day, hour, pilot) -> (day, flight); Union (6, b)
d) (day, hour, pilot) -> gate; Transitive (c, 3)
Q.E.D.

Prove: (day, hour, gate) -> pilot
a) day -> day; Reflexive
b) day, hour, gate -> day; Augmentation (a)
c) day, hour, gate -> (day, flight); Union (9, b)
d) day, hour, gate -> pilot; Transitive (c, 4)< br>Q.E.D.

Prove: (day, flight) -> gate
a) (day, flight, pilot) -> gate; Pseudotransitivity (2, 5)
b) (day, flight, day, flight) -> gate; Pseudotransitivity (a, 4)
c) (day, flight) -> gate; Left reduction (b)
Q.E.D.

Prove: (day, flight) -> pilot
a) (day, flight, gate) -> pilot; Pseudotransitivity (2, 8)
b) (day, flight, day, flight) -> pilot; Pseudotransitivity (a, 3)
c) (day, flight) -> pilot; Left reduction (b)
Q.E.D.

Prove: (day, hour, gate) -> flight
a) (day, hour) -> (day, hour); Reflexivity
b) (day, hour, gate) -> (day, hour); Augmentation (a)
c) (day, hour, gate) -> (day, hour, pilot); Union (b, 8)
d) (day, hour, gate) -> flight; Transitivity (c, 6)
Q.E.D.

Prove: (day, hour, pilot) -> flight
a) (day, hour) -> (day, hour); Reflexivity
b) (day, hour, pilot) -> (day, hour); Augmentation (a)
c) (day, hour, pilot) -> da y, hour, gate; Union (b, 5)
d) (day, hour, pilot) -> flight; Transitivity (c, 9)
Q.E.D.

Prove: (day, hour, gate) -> destination
a) (day, hour, gate) -> destination; Transitivity (9, 1)
Q.E.D.

Prove: (day, hour, pilot) -> destination
a) (day, hour, pilot) -> destination; Transitivity (6, 1)
Q.E.D.

Now that we've shown you how to derive eight of the ten FD's from other FD's,
you can try mixing and matching the FD's into sets so that each set meets the
following criteria:

1) Each attribute must be represented on either the left or right side of at
least one FD in the set.

2) If a given FD is included in the set then all the FD's needed to derive it
cannot also be included.

3) If a given FD is excluded from the set then the FD's used to derive it
must be included.

This produces a set of "non-redundant covers", which can be found with trial
and error and common se nse. For example, if we excluded (day, hour, gate) ->
flight we must then include (day, hour, gate) -> pilot and vice versa because
each are used in the other's derivation. If you want to be sure your search
was exhaustive, however, you may want to apply a more mechanical method,
which is what the CASE tools do for you.

The algorithm for accomplishing this task is basically to generate all the
combinations of sets of the FD's. (flight -> destination) and (flight ->
hour) are excluded in the combination generation because they cannot be
derived. This gives us (2^8) or 256 combinations of FD's. Each combination
is then tested against the criteria.

Fortunately, a simple spreadsheet does all the tedious work. In this problem
the criteria one eliminates only 15 sets. Criteria two eliminates 152 sets
and criteria three drops another 67. This leaves us with 22 possible covers,
five of which are the answers we are looking for (we will explain the other
17 later). These five non-redundant covers are:

Set I:
flight -> destination
flight -> hour
(day, hour, gate) -> flight
(day, hour, gate) -> pilot
(day, hour, pilot) -> gate

Set II:
flight -> destination
flight -> hour
(day, hour, gate) -> pilot
(day, hour, pilot) -> flight
(day, hour, pilot) -> gate

Set III:
flight -> destination
flight -> hour
(day, flight) -> gate
(day, flight) -> pilot
(day, hour, gate) -> flight

Set IV:
flight -> destination
flight -> hour
(day, flight) -> gate
(day, hour, gate) -> pilot
(day, hour, pilot) -> flight

Set V:
flight -> destination
flight -> hour
(day, flight) -> pilot
(day, hour, gate) -> flight
(day, hour, pilot) -> gate
(day, hour, pilot) -> flight

At this point we perform unions on FD's with the same left hand side and make
tables for each grouping with the left hand side as a key. We can also
eliminate symmetrical FD's (defined as X -> Y and Y -> X, and written with a
two headed arrow, X <-> Y) by collapsing them into the same table.

These five non-redundant covers convert into the following five sets of 3NF
relational schemas. They are given in a shorthand SQL DDL (Data Declaration
Language) without data type declarations.

Solution 1:
CREATE TABLE R1 (flight, destination, hour,
PRIMARY KEY (flight));
CREATE TABLE R2 (day, hour, gate, flight, pilot,
PRIMARY KEY (day, hour, gate));

Solution 2:
CREATE TABLE R1 (flight, destination, hour,
PRIMARY KEY (flight));
CREATE TABLE R2 (day, hour, gate, flight, pilot,
PRIMARY KEY (day, hour, pilot));

Solution 3:
CREATE TABLE R1 (flight, destination, hour, PRIMARY KEY (flight));
CREATE TABLE R2 (day, flight, gate, pilot, PRIMARY KEY (day, flight));
CREATE TABLE R3 (day, hour, gate, flight,
PRIMARY KEY (day, hour, gate));
CREATE TABLE R4 (day, hour, pilot, flight,
PRIMARY KEY (day, hour, pilot));

Solution 4:
CREATE TABLE R1 (flight, destination, hour,
PRIMARY KEY (flight));
CREATE TABLE R2 (day, flight, gate, PRIMARY KEY (day, flight));
CREATE TABLE R3 (day, hour, gate, pilot,
PRIMARY KEY (day, hour, gate));
CREATE TABLE R4 (day, hour, pilot, flight,
PRIMARY KEY (day, hour, pilot));

Solution 5:
CREATE TABLE R1 (flight, destination, hour,
PRIMARY KEY (flight));
CREATE TABLE R2 (day, flight, pilot, PRIMARY KEY (day, flight));
CREATE TABLE R3 (day, hour, gate, flight,
PRIMARY KEY (day, hour, gate));
CREATE TABLE R4 (day, hour, pilot, gate,
PRIMARY KEY (day, hour, pilot));

Once you match up these solutions with the minimal covers that generated
them, you will probably notice that the first tw o solutions have transitive
dependencies. But they are still 3NF! This is a point not well understood
by most analysts. A relation is in 3NF if for each FD X -> Y then X is a
superkey OR Y is part of a candidate key. The first two solutions meet this
criteria.

You may also notice that there are no additional candidate keys defined in
any of the tables. This would make sense in the first two solutions but this
was not done (this is why they are in 3NF and not BCNF). You will find this
algorithm used in CASE tool software because SQL-89 only allowed you to
define PRIMARY KEY constraints. SQL-92 allows you to define a UNIQUE
constraint on one or more columns in addition. Most implementations of SQL
also allow the user to define unique indexes on a subset of the columns.

All of the five solutions are 3NF, but since the first two solutions leave
out two FD's. It appears that solutions without all the constrain ts are
considered valid by this particular automated normalizer. These tables could
have defined the required candidate keys with UNIQUE constraints, however.
The normalizer used to get these solutions may leave out some of the
constraints, but still generate 3NF schemas. Watch out! It is assuming
that you can handle this outside the schema or are willing to convert the
FD's to some sort of constraints.

If we are allowed to drop FD's (as this particular normalizer does) then
there are actually 22 solutions (most of which are not generated by this
normalizer). These solutions can be found by dropping attributes or whole
tables from the solutions above (note that each attribute must still be
represented in at least one table). Some of the other 17 solutions can be
generated by:

1) Dropping either or both of the last two tables in the last three solutions

2) Dropping combinations of gate, flight, and pilot w here they are not keys
(remember to keep at least one non-key in each table and make sure if an
attribute is dropped from one table it is still represented somewhere else).

3) Add UNIQUE constraints or indexes to the first two solutions.

Did you notice that even though the last three of the five given solutions to
the problem still allow some anomalous states? Consider this: In solution
three the last two tables could have day and flight combinations that are not
part of the valid day and flight list as defined in the second table. The
other solutions also have integrity problems like this.

There is a normal form that fixes this for us: Domain/Key Normal Form (DKNF)
defined by Ronald Fagin in 1981. There is not yet a general algorithm that
will always generate the DKNF solution given a set of constraints. We can,
however, determine DKNF in many special cases. Here is our DKNF solution to
the problem:

Solu tion 6:

CREATE TABLE R1
(flight, hour, destination,
UNIQUE (flight, hour, destination),
UNIQUE (flight, hour),
UNIQUE (flight));

CREATE TABLE R2
(day, flight, hour, gate, pilot,
UNIQUE (day, flight, gate),
UNIQUE (day, flight, pilot),
UNIQUE (day, flight),
FOREIGN KEY (flight, hour) REFERENCES R1(flight, hour));

Notice that this is a case of normalization by dropping a table and adding
UNIQUE constraints. The candidate key (flight, hour) may not seem necessary
with flight also defined as a candidate key in R1. This is done so that the
foreign key in R2 carries the (flight, hour) combination and not just flight.
This way, the second relation cannot contain an invalid (flight, hour)
combination.

Once we add in the foreign keys to solutions 3, 4 and 5, are all of the
anomalies removed? No, not entirely. The only solution that removes all
anomalies is still the DKNF solution. T he best way to enforce these
constraints is to collapse all but the first table into one. This way
inconsistent gate, pilot, day, hour, flight combinations cannot exist. This
is because with only one table to hold such a combination we cannot have the
problem of two such tables with many overlapping attributes disagreeing.
This is what the DKNF solution accomplishes.

02.10 Practical Hints for Normalization

CASE tools implement formal methods for doing normalization. In particular,
E-R (Entity-Relationship) diagrams are very useful for this. However, a few
informal hints can help speed up the process and give you a good start.

Broadly speaking, tables represent either entities or relationships, which is
why E-R diagrams work so well as a design tool. The tables that represent
entities should have a simple, immediate name suggested by their contents --
a table named Students has student data in it, not student d ata and bowling
scores. It is also a good idea to use plural or collective nouns as the
names of such tables to remind you that a table is a set of entities; the
rows are the single instances of them.

Tables which represent many to many relationships should named by their
contents and should be as minimal as possible. For example, Students are
related to Classes by a third (relationship) table for their attendance.
These tables might represent a pure relationship or they might contain
attributes which exist within the relationship, such as a grade for the class
attended. Since the only way to get a grade is to attend the class, the
relationship is going to have a column for it, and will be names ReportCards
or Grades.

Avoid NULLs whenever possible. If a table has too many NULLable columns, it
is probably not normalized properly. Try to use a NULL only for a value
which is missing now, but which will be reso lved later. Even better, put
missing values into the encoding schemes for that column, as discussed in the
section of this book on encoding schemes.

A normalized database will tend to have a lot of tables with a small number
of columns per table. Don't panic when you see that happen. People who
first worked with file systems (particularly on computers which used magnetic
tape) tend to design one monster file for an application and do all the work
against those records. This made sense in the old days, since there was no
reasonable way to JOIN a number of small files together without having the
computer operator mount and dismount lots of different tapes. The habit of
designing this way carried over to disk systems, since the procedural
programming languages were still the same.

The same non-key attribute in more than one table is probably a normalization
problem. This is not a certainty, just a guideline. Th e key that determines
that attribute should be in only one table, and therefore the attribute
should be with it.

As a practical matter, you are apt to see the same attribute under different
names and need to make the names uniform in the entire database. The columns
"date_of_birth", "birthdate", "birthday", and "dob" are very likely the same
attribute of an employee.

02.11 Practical Hints for Denormalization

The subject of denormalization is a great way to get into religious wars. At
one extreme, you will find relational purist who think that the idea of not
carrying a database design to at least 3NF is a crime against nature. At the
other extreme, you will find people who simply add and move columns all over
the database with ALTER statements, never keeping the schema stable.

The reason for denormalization is performance. A fully normalized database
requires a lot of JOINs to construct common VIEWs o f data from its
components. JOINs are very costly in terms of time and computer resources,
so by "pre-constructing" the JOIN in a denormalized table can save quite a
bit.

Consider this actual problem which appeared on CompuServe's ORACLE forum. A
pharmaceutical company has an inventory table, and a table of the price
changes which look like this:

CREATE TABLE Drugs
(drug_nbr INTEGER PRIMARY KEY,
drug_name CHAR(30) NOT NULL,
quantity INTEGER NOT NULL
CONSTRAINT positive_quantity
CHECK(quantity > 0),
...);

CREATE TABLE Prices
(drug_nbr INTEGER NOT NULL,
start_date DATE NOT NULL,
end_date DATE NOT NULL
CONSTRAINT started_before_endded
CHECK(start_date <= end_date),
price DECIMAL(8, 2) NOT NULL,
PRIMARY KEY (drug_nbr, start_date));

Every order has to use the order date to find what the selling price was when
the order was place d. The current price will have a value of "eternity" (a
dummy date set so high that it will not be reached like '9999-12-31'). The
(end_date + 1 DAY) of one price will be equal to the start_date of the next
price for the same drug.

While this is normalized, performance will stink. Every report, invoice or
query will have a JOIN between Drugs and Prices. The trick here is to add
more columns to the Drugs, like this:

CREATE TABLE Drugs
(drug_nbr INTEGER PRIMARY KEY,
drug_name CHAR(30) NOT NULL,
quantity INTEGER NOT NULL
CONSTRAINT posirive quantity
CHECK(quantity > 0),
currentstartdate DATE NOT NULL,
currentenddate DATE NOT NULL,
CONSTRAINT current_start_before_endded
CHECK(currentstartdate <= currentenddate),
currentprice DECIMAL(8, 2) NOT NULL,
priorstartdate DATE NOT NULL,
priorenddate DATE NOT NULL,
CONSTRAINT prior_start_before_endded
CHECK(prio rstartdate <= priorenddate),
AND (currentstartdate = priorenddate + INTERVAL 1 DAY),
priorprice DECIMAL(8, 2) NOT NULL,
...);

This covered over 95% of the orders in the actual company because very few
orders are held up more than two price changes. The odd exception was
trapped by a procedural routine.

Another good trick is to preserve the constraints which were implied by the
original tables and move them into CHECK() clauses. Consider two of the
tables from the examples used at the start of this chapter:

CREATE TABLE Enrollment (student, course, section, grade,
PRIMARY KEY (student, course));

CREATE TABLE Students (student, major, PRIMARY KEY (student));

If we want to combine them into a single denormalized table, we might start
with:

CREATE TABLE StudentEnrollment
(student, major, course, section, grade);

then add a constraint to assure that no student has exactly one major:

CHECK (NOT EXISTS (SELECT student, COUNT(major)
FROM StudentEnrollment
GROUP BY student
HAVING COUNT(major) <> 1);

and then more constraints to assure that a student is not signed up for the
same course twice and so forth. Yes, insertions will take time because of
the extra checking that will done, but this is the price you pay for speed in
doing queries.

02.11.1 Row sorting

1925 Words

Back on 2001 May 27, Fred Block posted a problem on the SQL Server newsgroup.
I will change the problem slightly, but the idea was that he had a table
with five character string columns that had to be sorted alphabetically
within each row. This "flatten table" is a very common denormalization,
which might invovle months of the year as columns, or other things which are
actign as repeating groups in violation of First Normal Form.

Let's declare the table to look like this and dive into the problem.

CREATE TABLE Foobar
(keycol INTEGER NOT NULL PRIMARY KEY,
c1 VARCHAR(20) NOT NULL,
c2 VARCHAR(20) NOT NULL,
c3 VARCHAR(20) NOT NULL,
c4 VARCHAR(20) NOT NULL,
c5 VARCHAR(20) NOT NULL);

This means that we want this condition to hold:

CHECK ((c1 <= c2) AND (c2 <= c3)
AND (c3 <= c4) AND (c4 <= c5))

Obviously, if he had added this constraint to the table in the first place,
we would be fine. Of course, that would have pushed the problem to the front
end and I would not have a topic for this section.

What was interesting was how everyone that read this newsgroup posting
immediately envisioned a stored procedure that would take the five values,
sort them and return them to their original row in the table. The only way
to make this approach work for the whole table was to write an update cursor
and loop thru all the rows of the table. Itzik Ben-Gan posted a simple
procedure that loaded the values into a temporary table, then pulled them out
in sorted order, starting with the minimum value, using a loop.

Another trick is the Bose-Nelson sort ("A Sorting Problem" by R.C. Bose and
R.J. Nelson; Journal of the ACM, vol 9 pages 282-296), which I had written
about in DR. DOBB'S JOURNAL back in 1985. This is a recursive procedure that
takes an integer and then generates swap pairs for a vector of that size. A
swap pair is a pair of position numbers from 1 to (n) in the vector which
need to be exchanged if they are out of order. This swap pairs are also
related to Sorting Networks in the literature (see THE ART OF COMPUTER
PROGRAMMING by Donald Knuth, vol 3).

You are probably thinking that this method is a bit weak because the results
are only good for sorting a fixed number of items. But a table only has a
fixed number of columns, so that is not a problem in denormalized SQL.

You can set up a sorting network that will sort five items, with the minimal
number of exchanges, nine swaps, like this:

swap (c1, c2);
swap (c4, c5);
swap (c3, c5);
swap (c3, c4);
swap (c1, c4);
swap (c1, c3);
swap (c2, c5);
swap (c2, c4);
swap (c2, c3);

You might want to deal yourself a hand of five playing cards in one suit to
see how it works. Put the cards face down on the table and pick up the
pairs, swapping them if required, then turn over the row to see that it is in
sorted order when you are done.

In theory, the minimum number of swaps needed to sort (n) items is CEILING
(log2 (n!)) and as (n) increases, this approaches O(n*log2(n)). The Computer
Science majors will remember that "Big O" expression as the expected
performance of the best sorting algorithms, such as Quicksort. The
Bose-Nelson method is very good for sma ll values of (n). If (n < 9) then it
is perfect, actually. But as things get bigger, Bose-Nelson approaches O(n ^
1.585). In English, this method is good for a fixed size list of 16 or fewer
items and goes to hell after that.

You can write a version of the Bose-Nelson procedure which will output the
SQL code for a given value of (n). The obvious direct way to do a swap() is
to write a chain of UPDATE statements. Remember that in SQL, the SET clause
assignments happen in parallel, so you can easily write a SET clause that
exchanges the two items when are out of order. Using the above swap chain,
we get this block of code:

BEGIN ATOMIC
-- swap (c1, c2);
UPDATE Foobar
SET c1 = c2, c2 = c1
WHERE c1 > c2;

-- swap (c4, c5);
UPDATE Foobar
SET c4 = c5, c5 = c4
WHERE c4 > c5;

-- swap (c3, c5);
UPDATE Foobar
SET c3 = c5, c5 = c3
WHERE c3 > c5;

-- swap (c3, c4);
UPDATE Foobar
SET c3 = c4, c4 = c3
WHERE c3 > c4;

-- swap (c1, c4);
UPDATE Foobar
SET c1 = c4, c4 = c1
WHERE c1 > c4;

-- swap (c1, c3);
UPDATE Foobar
SET c1 = c3, c3 = c1
WHERE c1 > c3;

-- swap (c2, c5);
UPDATE Foobar
SET c2 = c5, c5 = c2
WHERE c2 > c5;

-- swap (c2, c4);
UPDATE Foobar
SET c2 = c4, c4 = c2
WHERE c2 > c4;

-- swap (c2, c3);
UPDATE Foobar
SET c2 = c3, c3 = c2
WHERE c2 > c3;

END;

This fully portable, standard SQL code and it can be machine generated. But
that parallelism is useful. It is worthwhile to combine some of the UPDATE
statements. But you have to be careful not to change the effective sequence
of the swap operations.

If you look at the first two UPDATE statements, you can see that they do not
overlap. This means you could roll them into one statement like this:

-- swap (c1, c2) AND swap (c4, c5);
UPDATE Foobar
SET c1 = CASE WHEN c1 <= c2 THEN c1 ELSE c2 END,
c2 = CASE WHEN c1 <= c2 THEN c2 ELSE c1 END,
c4 = CASE WHEN c4 <= c5 THEN c4 ELSE c5 END,
c5 = CASE WHEN c4 <= c5 THEN c5 ELSE c4 END
WHERE c4 > c5 OR c1 > c2;

The advantage of doing this is that you have to execute only one UPDATE
statement and not two. Updating a table, even on non-key columns, usually
locks the table and prevents other users from getting to the data. If you
could roll the statements into one single UPDATE, you would have the best of
all possible worlds, but I doubt that the code would be easy to read.

We can see this same pattern in the pair of statements.

swap (c1, c3);
swap (c2, c5);

But there are other patterns, so you can write general templates for them.
Consider this one:

swap (x, y);
swap (x, z);

If you write out all possible triplets and apply these two operations on
them, thus.

(x, y, z) => (x, y, z)
(x, z, y) => (x, z, y)
(y, x, z) => (x, y, z)
(y, z, x) => (x, z, y)
(z, x, y) => (x, y, z)
(z, y, x) => (x, y, z)

The result of this pattern is that x is lowest value of the three values, and
y and z either stay in the same relative position to each other or get sorted
properly. Getting them properly sorted would have the advantage of saving
exchanges later and also reducing the set of the subset being operated upon
by each UPDATE statement. With a little thought, we can write this symmetric
piece of code.

-- swap (x, y) AND swap (x, z);
UPDATE Foobar
SET x = CASE WHEN x BETWEEN y AND z THEN y
WHEN z BETWEEN y AND x THEN y
WHEN y BETWEEN z AND x THEN z
WHEN x BETWEEN z AND y THEN z
ELSE x END,
y = CASE WHEN x BETWEEN y A ND z THEN x
WHEN x BETWEEN z AND y THEN x
WHEN z BETWEEN x AND y THEN z
WHEN z BETWEEN y AND x THEN z
ELSE y END,
z = CASE WHEN x BETWEEN z AND y THEN y
WHEN z BETWEEN x AND y THEN y
WHEN y BETWEEN z AND x THEN x
WHEN z BETWEEN y AND x THEN x
ELSE z END
WHERE x > z OR x > y;

While it is very tempting to write more and more of these pattern templates,
it might be more trouble than it is worth because of increased maintenance
and readability.

Here is a 'C' program for the Bose-Nelson sort as given in THE C/C++ USER'S
JOURNAL (1993 February issue, "sorting Networks" by Frederick Hegeman).

Bose(n) int n;
/ * Calling Bose(n) generates a network
to sort n items. See R. C. Bose & R. J. Nelson,
"A Sorting Problem", JACM Vol. 9, Pp. 282-296. */

{ Pstar(1, n) ; /* sort the sequence (X1,...,Xn) */
}

P(i, j) int i, j;
{ printf("swap(%d, %d);\n", i, j); }

Pstar(i, m)
int i; /* value of first element in sequence */
int m; /* length of sequence */
{ int a;
if(m > 1)
{ /* Partition into 2 shorter sequences,
* generate a sorting method for each,
* and merge the two sub-networks. */
a = m/2;
Pstar(i, a);
Pstar((i + a), (m - a));
Pbracket(i, a, (i + a), (m - a));
}
}

Pbracket(i, x, j, y)
int i; /* value of first element in sequence 1 */
int x; /* length of sequence 1 */
int j; /* value of first element in sequence 2 */
int y; /* length of sequence 2 */
{ int a, b;

if(x == 1 && y == 1)
P(i, j); /* 1 comparison sorts 2 items */
else if(x == 1 && y == 2)
{ /* 2 comparisons inserts an item into an
* already sorted sequence of length 2. */
P(i, (j + 1));
P(i, j);
}
else if(x == 2 && y == 1)
{ /* As above, but inserting j */
P(i, j);
P((i + 1), j);
}
else {
/* Recurse on shorter sequences, attempting
to make the length of one subsequence odd
and the length of the other even. If we
can do this, we eventually merge the two. */
a = x/2;
b = (x & 1) ? (y/2) : ((y + 1)/2);
Pbracket(i, a, j, b);
Pbracket((i + a), (x - a), (j + b), (y - b));
Pbracket((i + a), (x - a), j, b);
}
} /* End of Bose */

-- 
regards
tim mcgrath
fremantle  western australia 6160
phone: +618 93352228  fax: +618 93352142 



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]


Powered by eList eXpress LLC