Introduction to Schema Refinement

CHAPTER 5

SCHEMA REFINEMENT

Prepared By :- Vijaykumar Mantri, BVRIT, NSP

• Conceptual database design gives us a set
of relation schemas and integrity constraints
(ICs) that can be regarded as a good starting
point for the final database design.
• This initial design must be refined by taking
the lCs into account more fully than is
possible with just the ER model constructs
and also by considering performance criteria
and typical workloads.

Introduction to Schema Refinement
• We now present an overview of the
problems that schema refinement is
intended to address and a refinement
approach based on decompositions.

• Redundant storage of information is the
root cause of these problems.
• Although decomposition can eliminate
redundancy, it can lead to problems of
its own and should be used with
caution.

Introduction to Schema Refinement
1) Problems caused by Redundancy
Redundant Storage
Update Anomalies
Insertion Anomalies
Deletion Anomalies

Hourly_Emps (SSN, Name, Lot,
Rating, Hourly_wages, Hours_worked)
SSN

Name


Lot Rating Hourly Hours_
_wages worked

123 Rajesh

48

8

10

40

456
326
434
612

22
35

35
35

8
5
5
8

10
7
7
10

30
30
32
40

Ajay
Arun

Kamal
Nitin

2. Decompositions
• The Problems arising from redundancy can be
solved by replacing a relation with collection of
smaller relations.
• A Decomposition of a relation schema R
consists of replacing the relation schema by two
(or more) relation schemas that each contain a
subset of attributes of R and together include all
attributes of R.
• Hourly_Emps2 (SSN, Name, Lot, Rating,
Hours_worked)
• Wages( Rating, Hourly_wages)

Problems related to Decomposition
• Unless we are careful decomposing a relation
schema can create some problems than it
solves.

We need to ask two questions repeatedly
1) Is there reason to decompose a relation?
• To answer this question, several normal forms
have been proposed for relations.
• If a relation schema is in one of these normal
forms, we know that certain kinds of problems
cannot arise.

2) What problems (if any) does the decomposition
cause?
• With respect to the second question, two properties
of decompositions are of particular interest. The
lossless-join property enables us to recover any
instance of the decomposed relation from
corresponding instances of the smaller relations.
• The dependency-preservation property enables us to
enforce any constraint on the original relation by
simply enforcing some constraints on each of the
smaller relations. That is, we need not perform joins
of the smaller relations to check whether a constraint

on the original relation is violated.

Functional Dependencies
• A Functional Dependencies (FD) is a kind of
IC that generalizes the concept of a key.
• Let R be a relation schema & let X & Y be
nonempty sets of attributes in R. then an
instance r of R satisfies the FD X Y if
following holds for every pair of tuples t1 & t2
in r
If t1.X = t2.X then t1.Y = t2.Y

A

B

C

D


a1

b1

c1

d1

a1

b1

c1

d2

a1

b2


c2

D1

a2

b1

c3

d1

AB  C


Closure of a Set of FDs
• We say that an FD f is implied by a given set F
of FDs if f holds on every relation instance that
satisfies all dependencies in F; that is, f holds
whenever all FDs in F hold.

• The set of all FDs implied by a given set F of
FDs is called the closure of F, denoted by F+.
• The three rules called Armstrong’s Axioms, can
be applied repeatedly to infer all FDs implied by
a set F of FDs.

Armstrong’s Axioms
Here X, Y & Z denote sets of attributes of relation
R:
• Reflexivity : If X  Y, then X  Y.
• Augmentation :
If X  Y, then XZ  YZ for any Z.
• Transitivity :
If X  Y and Y  Z, then X  Z
• Union : If X Y & X  Z, then XYZ
• Decomposition :
If XYZ, then X Y & X  Z





Contracts ( contractid, supplierid, projectid,
deptid, partid, qty, value)
This can be denoted as CSJDPQV.
The meaning of tuple is that the contract with
contractid C is an agreement that supplier S
will supply Q items of part P to project J
associated with department D, the value V of
this contract is equal to value.

• The ICs are known to hold are
1.The contract id C is a key : C  CSJDPQV
2.A project purchases a given part using a single
contract: JP  C
3.A department purchases at most one part from
supplier: SD  P









Some additional FDs hold in the
closure of the set of given FDs
From JP  C, C  CSJDPQV & transitivity
JP  CSJDPQV
From SD  P & augmentation
SDJ  JP
From SDJ  JP & JP  CSJDPQV &
transitivity
SDJ  CSJDPQV
From C CSJDPQV using decomposition
C  C, C  S, C  J, etc.
And we may have number of FDs from
reflexivity.

Attribute Closure
• If we just want to check whether a given
dependency, say, X  Y, is in the closure of a
set F of FDs, we can do so efficiently without
computing F+.
• We first cornpute the Attribute closure X+ with
respect to F, is the set of attributes A such that X
 A can be inferred using the Armstrong
Axioms. We can find attribute closure using this
algorithm.

Closure = X
Repeat until there is no change: {
If there is an FD V  W in F such that
V C closure,
then set closure = closure U W
}

Definitions
• Already we know definition of Key, Candidate Key
& Primary Key.
• Superkey – A superkey of a relation schema
R={A1, A2, …An} is a set of attributes S R with
property that no two tuples t1 & t2 in any legal
relation state r of R will have t1[S]=t2[S].
• Prime Attribute – An attribute of relation schema
R is called a prime attribute of R if it is a member of
some candidate key of R.

In above example Marks is fully functionally
dependent on STUDENT# COURSE# and not on
subset of STUDENT# COURSE#. This means Marks
can not be determined either by STUDENT# OR
COURSE# alone. It can be determined only using
STUDENT# AND COURSE# together. Hence Marks
is fully functionally dependent on STUDENT#
COURSE#.
CourseName is not fully functionally dependent on
STUDENT#
COURSE#
because
subset
of
STUDENT#
COURSE#
i.e
only
COURSE#
determines the CourseName and STUDENT# does
not have any role in deciding CourseName. Hence
CourseName is not fully functionally dependent on
STUDENT# COURSE#.

In the above relationship CourseName,
IName, Room# are partially dependent on
composite attributes STUDENT# COURSE#
because
COURSE#
alone
defines
the
CourseName, IName, Room#.

In above example, Room# depends on IName
and in turn IName depends on COURSE#.
Hence Room# transitively depends on
COURSE#.
Similarly Grade depends on Marks, in turn
Marks depends on STUDENT# COURSE#
hence Grade depends Fully transitively on
STUDENT# COURSE#.
Transitive: Indirect

Normal Forms

• First Normal Form (1NF)
– Atomic values
• Second Normal Form (2NF), Third Normal
Form 3NF & Boyce-Codd Normal Form
(BCNF)
– based on primary keys
• Fourth Normal Form (4NF)
– based on keys, multi-valued
dependencies
• Fifth Normal Form (5NF )
– based on keys, join dependencies
• Domain-Key Normal Form

Levels of Normalization
1NF
2NF
3NF
4NF
5NF

DKNF

Each higher level is a subset of the lower level

Normalization
No transitive
dependency
between
nonkey
attributes
All
determinants
are candidate
keys - Single
multivalued
dependency

BoyceCodd and
Higher

Functional
dependency
of nonkey
attributes on
the primary
key - Atomic
values only
Full
Functional
dependency
of nonkey
attributes on
the primary
key

Most databases should be 3NF or BCNF in
order to avoid the database anomalies.

First Normal Form (1NF)
• Historically, it is designed to
disallow
– Composite attributes
– Multivalued attributes
– Or the combination of both

• All the values need to be
atomic

In relational database design it is not practically
possible to have a table which is not in 1NF.

ISBN

Title

AuName

AuPhone

PubName

PubPhone

Price

0-321-32132-1

Balloon

Sleepy,
Snoopy,
Grumpy

321-321-1111,
232-234-1234,
665-235-6532

Small House

714-000-0000

$34.00

0-55-123456-9

Main Street

Jones,
Smith

123-333-3333,
654-223-3455

Small House

714-000-0000

$22.95

0-123-45678-0

Ulysses

Joyce

666-666-6666

Alpha Press

999-999-9999

$34.00

1-22-233700-0

Visual
Basic

Roman

444-444-4444

Big House

123-456-7890

$25.00

Author and AuPhone columns are multivalued
ISBN

AuName

AuPhone

0-321-32132-1

Sleepy

321-321-1111

ISBN

Title

PubName

PubPhone

Price

0-321-32132-1

Snoopy

232-234-1234

0-321-32132-1

Balloon

Small House

714-000-0000

$34.00

0-321-32132-1

Grumpy

665-235-6532

0-55-123456-9

Main Street

Small House

714-000-0000

$22.95

0-55-123456-9

Jones

123-333-3333

0-123-45678-0

Ulysses

Alpha Press

999-999-9999

$34.00

0-55-123456-9

Smith

654-223-3455

1-22-233700-0

Visual
Basic

Big House

123-456-7890

$25.00

0-123-45678-0

Joyce

666-666-6666

1-22-233700-0

Roman

444-444-4444

Result Table

Second Normal Form (2NF)

• fd1 and fd4 are partial functional
dependencies. Normalize to:
– Emp (eno, ename, title, bdate, salary, supereno,
dno)
– WorksOn (eno, pno, resp, hours)
– Proj (pno, pname, budget)

Old Scheme  {Studio, Movie, Budget, Studio_City}
1.
2.
3.
4.
5.

Key  {studio, movie}
{studio, movie}  {budget}
{studio}  {studio_city}
studio_city is not a part of a key
studio_city functionally depends on studio which is a
proper subset of the key

New Scheme  {Studio, Movie, Budget}

New Scheme  {Studio, Studio_City}

Scheme  {City, Street,
HouseColor, CityPopulation}
1.
2.
3.
4.
5.

HouseNumber,

key  {City, Street, HouseNumber}
{City, Street, HouseNumber}  {HouseColor}
{City}  {CityPopulation}
CityPopulation does not belong to any key.
CityPopulation is functionally dependent on the City
which is a proper subset of the key

New Scheme  {City, Street, HouseNumber,
HouseColor}
New Scheme  {City, CityPopulation}

Third Normal Form (3NF)
• Third normal form (3NF) is based on the
concept of transitive dependency.
A functional dependency X  Y in a
relation schema R is a transitive dependency
if there is a set of attributes Z that is neither
a candidate key nor a subset of any key of
R, and both X  Z and Z  Y hold.
• Definition : A relation schema R is in 3NF if
it satisfies 2NF and no nonprime attribute of
R is transitively dependent on the primary
key.

Let R be a relation schema, F be the
set of FDs given to hold over R, X be a
subset of the attributes of R and A be an
attribute of R.
R is in third normal form if, for every FD X
 A in F, one of the following statement is
true.

• A  X, that is, it is a trivial FD or
• X is a superkey or
• A is part of some key for R.

Result Table

RESULTMARKS TABLE

Third Normal Form (3NF)

fd2 results in a transitive dependency eno →
salary. Remove it.

Scheme  {Title, PubID, PageCount, Price }
1.
2.
3.
4.
5.

Key  {Title, PubId}
{Title, PubId}  {PageCount}
{PageCount}  {Price}
Both Price and PageCount depend on a key hence 2NF
Transitively {Title, PubID}  {Price} hence not in 3NF

New Scheme  {PubID, PageCount, Price}
New Scheme  {Title, PubID, PageCount}
Scheme  {BuildingID, Contractor, Fee}
1.

Primary Key  {BuildingID}

2.

{BuildingID}  {Contractor}

3.

{Contractor}  {Fee}

4.
5.

{BuildingID}  {Fee}
Fee transitively depends on the BuildingID

6.

Both Contractor and Fee depend on the entire key hence 2NF

New Scheme  {BuildingID, Contractor}
New Scheme  {Contractor, Fee}

Boyce-Codd Normal Form (BCNF)
• Most 3NF relations are also BCNF
relations.
• A 3NF relation is NOT in BCNF if:
 Candidate keys in the relation are composite
keys (they are not single attributes)
 There is more than one candidate key in the
relation, and
 The keys are not disjoint, that is, some
attributes in the keys are common

Boyce-Codd Normal Form (BCNF)
• Let R be a relation schema, F be the set of FDs
given to hold over R, X be a subset of the
attributes of R and A be an attribute of R. R is in
Boyce-Codd normal form if, for every FD X  A in
F, one of the following statement is true.
A  X, that is, it is a trivial FD or
X is a superkey.
• The difference between 3NF and BCNF is that 3NF
allows a FD X → Y to remain in the relation if X is a
superkey or Y is a prime attribute. BCNF only
allows this FD if X is a superkey.
• Thus, BCNF is more restrictive than 3NF.
However, in practice most relations in 3NF are also
in BCNF.

BCNF versus 3NF
• We can decompose to BCNF but sometimes we do
not want to if we lose a FD.
• The decision to use 3NF or BCNF depends on the
amount of redundancy we are willing to accept and
the willingness to lose a functional dependency.
• Note that we can always preserve the lossless-join
property (recovery) with a BCNF decomposition,
but we do no always get dependency preservation.
• In contrast, we get both recovery and dependency
preservation with a 3NF decomposition.

An example of not having dependency preservation with
BCNF:
Scheme  {City, Street, ZipCode }
1. Key1  {City, Street }
2. Key2  {ZipCode, Street}
3. No non-key attribute hence 3NF
4. {City, Street}  {ZipCode}
5. {ZipCode}  {City}
6. Dependency between attributes belonging to a key
New Scheme1  {ZipCode, Street }
New Scheme2  {ZipCode, City}

• Consider the relation schema LOTS1A
shown in Figure, which describes land for sale
in various countries. Suppose that there are
two candidate keys:
PROPERTY_ID#
and {COUNTY_NAME, LOT#}
that is, LOT Numbers are unique only within
each Country, but PROPERTY_ID numbers
are unique across all Countries.

• Suppose that we have thousands of lots in
the relation but the lots are from only two
countries: Nepal & Srilanka.
• Suppose also that lot sizes in Nepal are only
0.5, 0.6, 0.7, 0.8, 0.9, and 1.0 acres,
whereas lot sizes in Srilanka are restricted to
1.1, 1.2, ... , 1.9, and 2.0 acres.
• In such a situation we would have the
additional functional dependency FD3: AREA
 COUNTY_NAME.

FD3

• If we add this to the other dependencies, the
relation schema LOTS1A still is in 3NF
because COUNTY_NAME is a prime attribute.

• The area of a lot that determines the country, as
specified by FD3, can be represented by 16 tuples
in a separate relation R(AREA,
COUNTRY_NAME), since there are only 16
possible AREA values. This representation
reduces the redundancy of repeating the same
information in the thousands of LOTS1A tuples.
• We can decompose LOTS1A into two BCNF
relations LOTSlAX and LOTSlAY.

FD3

This decomposition loses the functional dependency
FD2 because its attributes no longer coexist in the same
relation after decomposition.

 The closure of F contains all dependencies in F+
AC, BA & CB.
 Consequently FAB also contains BA & FBC
contains CB. Therefore FAB U FBC contains
AB, BC, BA & CB.
 The closure of the dependencies in FAB & FBC now
includes CA.
 Thus the decomposition preserves the dependency
CA.

Multivalued Dependencies
• Suppose that we have a relation with
attributes course, teacher, and book, which we
denote as CTB.
• The meaning of a tuple is that teacher T can
teach course C, and book B is a
recommended text for the course.
• There are no FDs; the key is CTB.
• However, the recommended texts for a course
are independent of the instructor.
• The instance shown in Figure illustrates this
situation.

Course
Physics101

Teacher
Green

Book
Mechanics

Physicsl0l

Green

Optics

Physicsl0l

Brown

Mechanics

Physics101

Brown

Optics

Math301
Math301
Math301

Green
Green
Green

Mechanics
Vectors
Geometry

Figure Instance of CTB

• The schema is in BCNF
• There is redundancy in schema.
• Green can teach Physics101 is recorded once per
recommended text for the course.
• Similarly, the fact that Optics is a text for
Physics101 is recorded once per potential teacher.
• The redundancy can be eliminated by
decomposing CTB into CT & CB.
• The redundancy in this example is due to the
constraint that the texts for course independent of
the instructors, which cannot be expressed in
terms of FDs.
• This constraint is an example of Multivalued
Dependency or MVD.

• Let R be a relation schema and let X and Y
be subsets of the attributes of R. Intuitively,
the Multivalued Dependency X   Y is
said to hold over R if, in every legal instance
r of R, each X value is associated with a set
of Y values and this set is independent of the
values in the other attributes.
• Formally, if the MVD X  Y holds over
and Z = R - XY, the following must be true
for every legal instance r of R
If tl  r, t2  r and t1.X= t2.X,
then there must be some t3 r such that
t1.XY = t3.XY and t2· Z = t3.Z.

• If we are given the first
two tuples and told that
the MVD X  Y
holds over this relation,
we can infer that the
relation instance must
also contain the third
tuple.

X

Y

Z

A

B1

C1

A

B2

C2

A

B1

C2

A

B2

C1

Fourth Normal Form
• Fourth Normal Form (4NF) is a direct
generalization of BCNF. R be a relation
schema, X and Y be nonempty subsets of
the attributes of R, and F be a set of
dependencies that includes both FDs and
MVDs R is said to be in Fourth Normal Form
(4NF), if, for every MVD XY that holds
over R, one of the following statements is
true:
• Y X or XY = R or
• X is a Superkey.

• The relation CTB is not in 4NF because
C  T is a nontrivial MVD and C is not a
key.
• We can eliminate the resulting redundancy
by decomposing CTB into CT and CB; each
of these relations is then in 4NF.