Comprehensive data warehouse exploration. pdf

Decision Support Systems 42 (2006) 859 – 878
www.elsevier.com/locate/dsw

Comprehensive data warehouse exploration with qualified
association-rule mining
Nenad Jukic´ a,*, Svetlozar Nestorov b,1
a

School of Business Administration, Loyola University Chicago, 820 N. Michigan Avenue, Chicago, IL 60640, USA
b
Department of Computer Science, The University of Chicago, Chicago, IL, USA
Received 10 May 2004; received in revised form 9 July 2005; accepted 11 July 2005
Available online 11 August 2005

Abstract
Data warehouses store data that explicitly and implicitly reflect customer patterns and trends, financial and business
practices, strategies, know-how, and other valuable managerial information. In this paper, we suggest a novel way of
acquiring more knowledge from corporate data warehouses. Association-rule mining, which captures co-occurrence patterns
within data, has attracted considerable efforts from data warehousing researchers and practitioners alike. In this paper, we
present a new data-mining method called qualified association rules. Qualified association rules capture correlations across the
entire data warehouse, not just over an extracted and transformed portion of the data that is required when a standard datamining tool is used.

D 2005 Elsevier B.V. All rights reserved.
Keywords: Data warehouse; Data mining; Association rules; Dimensional model; Database systems; Knowledge discovery

1. Introduction
Data mining is defined as a process whose objective is to identify valid, novel, potentially useful and
understandable correlations and patterns in existing
data, using a broad spectrum of formalisms and techniques [9,23]. Mining transactional (operational) data-

* Corresponding author. Tel.: +1 312 915 6662.
E-mail addresses: [email protected]
(N. Jukic´), [email protected] (S. Nestorov).
1
Tel.: +1 773 702 3497.
0167-9236/$ - see front matter D 2005 Elsevier B.V. All rights reserved.
doi:10.1016/j.dss.2005.07.009

bases, containing data related to current day-to-day
organizational activities could be of limited use in
certain situations. However, the most appropriate
and fertile source of data for meaningful and effective

data mining is the corporate data warehouse, which
contains all the information from the operational data
sources that has analytical value. This information is
integrated from multiple operational (and external)
sources, it usually reflects substantially longer history
than the data in operational sources, and it is structured specifically for analytical purposes.
The data stored in the data warehouse captures
many different aspects of the business process across

860

N. Jukic´, S. Nestorov / Decision Support Systems 42 (2006) 859–878

various functional areas such as manufacturing, distribution, sales, and marketing. This data explicitly
and implicitly reflects customer patterns and trends,
business practices, organizational strategies, financial
conditions, know-how, and other knowledge of potentially great value to the organization.
Unfortunately, many organizations often underutilize their already constructed data warehouses [12,13].
While some information and facts can be gleaned
from the data warehouse directly, through the utilization of standard on-line analytical processing (OLAP),

much more remains hidden as implicit patterns and
trends. The standard OLAP tools have been performing well their primary reporting function where the
criteria for aggregating and presenting data are specified explicitly and ahead of time. However, it is the
discovery of information based on implicit and previously unknown patterns that often yields important
insights into the business and its customers, and may
lead to unlocking hidden potential of already collected
information. Such discoveries require utilization of
data mining methods.
One of the most important and successful data
mining methods for finding new patterns and correlations is association-rule mining. Typically, if an organization wants to employ association-rule mining on
their data warehouse data, it has to use a separate datamining tool. Before the analysis is to be performed,
the data must be retrieved from the database repository that stores the data warehouse, transformed to fit
the requirements of the data-mining tool, and then
stored into a separate repository. This is often a cumbersome and time-consuming process. In this paper
we describe a direct approach to association-rule data
mining within data warehouses that utilizes the query
processing power of the data warehouse itself without
using a separate data mining tool.
In addition, our new approach is designed to answer
a variety of questions based on the entire set of data

stored in the data warehouse, in contrast to the regular
association-rule methods which are more suited for
mining selected portions of the data warehouse. As
we will show, the answers facilitated by our approach
have a potential to greatly improve the insight and the
actionability of the discovered knowledge.
This paper is organized as follows: in Section 2 we
describe the concept of association-rule data mining
and give an overview of the current limitations of

association-rule data mining practices for data warehouses. Section 3 is the focal point of the paper. In it
we first introduce and define the concept of qualified
association rules. In 3.1 we discuss how qualified
association rules broaden the scope and actionability
of the discovered knowledge. In 3.2 we describe why
existing methods cannot be feasibly used to find
qualified association rules. In 3.3 3.4 and 3.5 we
offer details of our own method for finding qualified
association rules. In Section 4 we describe an illustrative experimental performance study of mining real
world data that uses the new method we introduced.

And finally, in Section 5 we offer conclusions.

2. Association-rule data mining in data warehouses
The standard association-rule mining [1,2] discovers correlations among items within transactions.
The prototypical example of utilizing association-rule
mining is determining what products are found
together in a basket at a checkout line at the supermarket; hence the often-used term: market basket
analysis [4]. The correlations are expressed in the
following form:
Transactions that contain X are likely to contain Y
as well noted as X Y Y, where X and Y represent sets
of transaction items. There are two important quantities measured for every association rule: support and
confidence. The support is the fraction of transactions
that contain both X and Y items. The confidence is the
fraction of transactions containing items X, which also
contain items Y. Intuitively, the support measures the
significance of the rule, so we are interested in rules
with relatively high support. The confidence measures
the strength of the correlation, so rules with low
confidence are not meaningful, even if their support

is high. A rule in this context is the relationship
among transaction items with enough support and
confidence.
The standard association rule mining process
employs the basic a-priori algorithm [1,3] for finding
sets of items with high support (often called frequent
itemsets). The crux of the a-priori approach is the
observation that a set of items I can be frequent
only if all proper subsets sub(I) are also frequent in
recorded transactions. Based on this observation, apriori applies step-wise pruning to the sets of items

861

N. Jukic´, S. Nestorov / Decision Support Systems 42 (2006) 859–878

that need to be counted, starting by eliminating single
items that are infrequent. For example, if we are
looking for all sets of items (pairs, triples, etc. . .)
that appear together in at least s transactions, we
can start by finding those items that by themselves

appear in at least s transactions. All items that do not
appear on their own in at least s transactions are
pruned out.
In practice, the discovery of association rules has
two phases. In the first phase, all frequent itemsets are
discovered using the a-priori algorithm. In the second
phase, the association rules among the frequent itemsets with high confidence are constructed. Since the
computational cost of the first phase dominates the
total computational cost [1], association-rule mining is
often defined as the following question pertaining to
transactional data, where each transaction contains a
set of items:
What items appear frequently together in transactions?
Association-rule data mining has drawn a considerable amount of attention from researchers in the last
decade. Much of the new work focuses on expanding
the extent of association-rule mining, such as mining
generalized rules (i.e. when transaction items belong
to certain types or groups, generalized rules find
associations among such types or groups) [14,24];
mining correlations and casual structures, which

finds generalized rules based on implicit correlations
among items, while taking into consideration both the
presence and the absence of an item in a transaction
[6,22]; finding association rules for numeric attributes,
where association rules, in addition to Boolean conditions (i.e. item present in the transactions), may
consider a numeric condition (e.g. an item that has a
numeric value, must have a value within a certain
range) [10,11]; finding associations among items
occurring in separate transactions [17,19,27]; etc.
In this paper we introduce qualified association
rules as a way of extending the scope of association
rule mining. Our method extends the scope of association rule mining to include multiple database relations (tables), in a way that was not previously
feasible. This approach is designed to provide a real
addition to the value of collected organizational information and is applicable in a host of real world
situations, as we will illustrate throughout this paper.
Many of the previously proposed association-rule
extensions are either applicable to a relatively narrow

set of problems, or represent a purely theoretical
advance. In addition, they also often require computational resources that may be unrealistic for most of the

potential organizational users. In contrast, our proposed method is both broadly applicable and highly
practical from the implementation and utilization
points of view.
Dimensional modeling [16], which is the most
prevalent technique for modeling data warehouses,
organizes tables into fact tables, containing basic
quantitative measurements of an organizational activity under consideration, and dimension tables that
provide descriptions of the facts being stored. Dimension tables and their attributes are chosen for their
ability to contribute to the analysis of the facts being
stored in the fact tables.
The data model that is produced by the dimensional modeling method is known as a star-schema
[7] (or a star-schema extension such as snowflake or
constellation). Fig. 1 shows a simple star-schema
model of a data warehouse for a retail company.
The fact table contains the sale figures for each sale
transaction and the foreign keys that connect it to the
four dimensions: Product, Customer, Location, and
Calendar.
Consider the data warehouse depicted by Fig. 1.
The standard association-rule mining question for this

environment would be:
What products are frequently bought together?
This question examines the fact table as it relates to
the product dimension only. A typical data warehouse,
however, has multiple dimensions, which are ignored

PRODUCT
DIMENSION
ProductID
Product
Group
Subcategory
Category
CUSTOMER
DIMENSION
CustomerID
Name
Gender
Zip
City

State

SALES FACT
TABLE
ProductID
LocationID
CalendarID
CustomerID
TransactID

LOCATION
DIMENSION
LocationID
Store
Region
Territory

CALENDAR
DIMENSION
CalendarID
Day
Week
Month
Season
Year

Fig. 1. Example retail company star-schema.

862

N. Jukic´, S. Nestorov / Decision Support Systems 42 (2006) 859–878

by the above single-dimension question, as illustrated
by Fig. 2.
For example, an analyst at the corporate headquarters may ask the following question:
What products are frequently bought together in a
particular region and during a particular month?
This question requires examination of multiple
dimensions of the fact table. Standard associationrule mining approach would not find the answer to
this question directly, because it only explores the fact
table with its relationship to the dimension table that
contains transaction-items (in this case Product),
while the other dimension tables (the non-item dimensions) are not considered. In fact, by applying standard algorithms directly, we may not discover any
association rules in situations when there are several
meaningful associations that involve multiple dimensions. The following example illustrates an extreme
case scenario of this situation.
Example 1. A major retail chain operates stores in
two regions: South and North; and divides the year
into two seasons: Winter and Summer. The retail
chain sells hundreds of different products including
sleeping bags and tents. Table 1 shows the number of
transactions for each region in each season as well as
the percentage of transactions that involved a sleeping
bag, a tent, or both.
Suppose we are looking for association rules with
0.5% support and 50% confidence. Let’s first consider
the transactions in all regions and all seasons. There
are 500 thousand transactions and only 10 thousand of
them involve both a sleeping bag and a tent, so the
PRODUCT
DIMENSION
ProductID
Product
Group
Subcategory
Category
CUSTOMER
DIMENSION
CustomerID
Name
Gender
Zip
City
State

SALES FACT
TABLE
ProductID
LocationID
CalendarID
CustomerID
TransactID

LOCATION
DIMENSION
LocationID
Store
Region
Territory

CALENDAR
DIMENSION
CalendarID
Day
Week
Month
Season
Year

Fig. 2. Association-rule data mining scope for the example retail
company star schema.

Table 1
Transaction statistics (in thousands) for Example 1
Total

Bags

Tents

Both

North
Winter
Summer

150
100

8
4

2
7

1
3

South
Winter
Summer

100
150

4
9

8
6

3
3

support of the pair of items is 2% (greater than 0.5%).
There are 25 thousand transactions that involve sleeping bags, 23 thousand transactions that involve tents.
Therefore the confidence of sleeping bag Y tent is
40% and the confidence of tent Y sleeping bag is
43%. Thus, no association rule involving sleeping
bags and tents will be discovered. Let’s consider the
transactions in the North region separately. In the
North sleeping bags appear in 12 thousand transactions, tents appear in 9 thousand transactions, and
both sleeping bags and tents appear together in 4
thousand transactions. The support2 for both items is
0.8% (greater than 0.5%) but the confidence for rules
involving sleeping bag and tent are 33% and 44%
(both less than 50%). Thus, no association rule involving bags and tents will be discovered. Similarly, no
rules involving both sleeping bag and tent will be
discovered in the South, and no associations will be
discovered between sleeping bags and tents when we
consider all the transactions in each season (Summer
and Winter) separately.
These conclusions (no association rules) are rather
surprising because if each combination of region and
season were considered separately the following
association rules would be discovered:
sleeping bag Y tent; in the North during the Summer (sup = 0.60%, conf = 75%)
n sleeping bag Y tent; in the South during the Winter
(sup = 0.60%, conf = 75%)
n tent Y sleeping bag; in the South during the Summer (sup = 0.60%, conf = 50%)
n

2
In dimension-specific computations, the support within dimensions (e.g. region or season or region and season) is calculated
relative to the entire set of transactions, in order to eliminate rules of
little overall significance.

863

N. Jukic´, S. Nestorov / Decision Support Systems 42 (2006) 859–878

Standard association rule mining cannot discover
these association rules directly, because it does not
account for non-item dimensions.
In addition to the possibility of suppressing some
of the existing relevant rules (as illustrated by Example 1), another type of problem can arise when multiple dimensions are not considered: an overabundance
of discovered rules may hide a smaller number of
meaningful rules. Example 2 illustrates an extreme
case scenario of this situation.
Example 2. Another major retail chain also operates
stores in two regions: South and North; and divides
the year into two seasons: Winter and Summer. This
retail chain sells hundreds of different products
including Product A, Product B, and Product C.
Table 2 shows the number of transactions for each
region in each season; the number of transactions
that involved Product A, Product B, Product C; as
well as the number of transactions that involved
Product A and Product B together, Product A and
Product C together, and Product B and Product C
together.
Suppose we are looking for association rules with
0.50% support and 50% confidence. If we consider
the transactions in all regions and all seasons we will
discover that every possible association rule involving two items satisfies the threshold, as shown
below:
Product A
conf = 55%)
n Product B
conf = 67%)
n Product A
conf = 55%)
n Product C
conf = 50%)
n

Y

Product

B

(sup = 1.00%,

Y

Product

A

(sup = 1.00%,

Y

Product

C

(sup = 1.00%,

Y

Product

A

(sup = 1.00%,

Table 2
Transaction statistics (in thousands) for Example 2
Total

A

B

C

A and B

A and C

B and C

North
Winter
Summer

200
100

2
1

3
2

3
4

2
1

2
0

2
2

South
Winter
Summer

100
200

1
7

1
3

1
4

1
2

1
3

1
1

Product B Y
conf = 67%)
n Product C Y
conf = 50%).

n

Product

C

(sup = 1.00%,

Product

B

(sup = 1.00%,

In this example, any combination of two of the
three products is correlated. Thus, it is difficult to
make any conclusions, reach any decisions, and take
any actions. This problem occurs when a mining
process produces a very large number of rules, thus
proving to be of little value to the user. However if we
consider each region and season separately we will
find only one rule within the given support and confidence limit:
n Product C Y Product A; in the South during the
Summer (sup3 = 0.50%, conf = 75%).
Isolating this rule may provide valuable additional
information to the user. The above two examples
illustrate two types of situations in which association
rules that consider both item-related dimension (e.g.
Product) and non-item related dimensions (e.g.
Region, Session) add new insight into the nature of
data to the user. Section 3 describes the method that
enables such discoveries.
2.1. Related work
Recent work on mining the data warehouse has
explored several approaches from different perspectives. Here we give a brief overview.
Both Refs. [15,21] extend OLAP techniques by
finding patterns in the form of summaries or trends
among certain cells of the data cube. While both
papers examine multiple dimensions of the data warehouse, neither discovers correlations among multiple
values of the same dimension. For example, the techniques detailed in Ref. [15] can discover that bthe
average number of sales for all camping products
combined (tents and bags) is the same during the
Winter in the South and during the Summer in the
North.Q However, their framework does not handle
association rules involving multiple products bought
in the same transactions, e.g. bnumber of transactions
3

Recall again that in dimension-specific computations, the support within dimensions is calculated relative to the entire set of
transactions, in order to eliminate rules of little overall significance.

864

N. Jukic´, S. Nestorov / Decision Support Systems 42 (2006) 859–878

involving both tents and bags is the same during the
Summer in the North and for all seasons in the SouthQ.
The framework presented in Ref. [21] can find summarized rules associated with a single item. For example, it can reduce the fact that bduring the Winter the
number of tents sold in the South is equal or greater
than the number sold in the NorthQ to the rule that
bduring the Winter the number of sales for all camping products in the South is equal or greater than the
number of sales in the NorthQ, if such a reduction
holds. However, rules about multiple products sold in
the same transaction lie beyond this framework.
The focus of Ref. [25] is on finding partitions of
attribute values that result in interesting correlations.
In contrast to our work, quantitative rules involve at
most one value or one value range for each attribute.
For example, btransactions involving a tent in the
North are likely to occur during the Summer Q is a
quantified rule, whereas btransactions involving a
tent in the North are likely to involve a bag Q is a
qualified rule that falls outside of the scope of quantified rules because it involves two values of the
Product dimension.
Generalized association rules [24] and multiplelevel association rules [8] capture correlations
among attribute values, possibly at different levels,
within the same dimensional hierarchy but not across
multiple dimensions. For example, btransactions
involving sporting goods tend to involve sunscreenQ
is within the framework of generalized rules and
multi-level rules, while the qualified rule btransaction
involving sporting goods tend to involve sunscreen in
the summerQ is beyond their scope.
A precursor to qualified association rules is discussed in Ref. [18] with the focus on dealing with
aggregate rather than transactional data.

3. Qualified association rules
Standard association rules express correlations
between values of a single dimension of the star
schema. However, as illustrated by Examples 1 and
2, values of the other dimensions may also hold
important correlations. For example, some associations become evident only when multiple dimensions
are involved. In Example 1, sleeping bags and tents
appear uncorrelated in the sales data as a whole, or

even when the data is considered separately by region
or season. Yet, several association rules are discovered
if the focus is on the data for a particular region during
a particular season. One such rule is bsleeping bag Y
tentQ for the transactions that occurred in the North
region during the Summer season. This association
rule involves more than just a pair of products; it also
involves the location as well as the time period of the
transaction. In order to capture this additional information, we can augment the association rule notation
as illustrated by the following qualified association
rule:
sleeping bag Y tent½region ¼ North;
season ¼ Summer:
We define qualified association rule as follows:
Let T be a data warehouse where X and Y are sets
of values of the item-dimension. Let Q be a set of
equalities assigning values to attributes of other (nonitem) dimensions of T. Then, X Y Y [ Q] is a qualified
association rule with the following interpretation:
Transactions that satisfy Q and contain X are likely
to contain Y.
For a more detailed definition consider a star
schema with d dimensions. Let the fact table be F
and the dimensions be D 0. . . D d . The schema of
dimension table D i consists of a key k i and other
attributes. The schema of the fact table F contains
the keys for all dimensions, k 0. . .k d , and a transaction
id attribute tid.
Without loss of generality, we can assume that the
item dimension is D 0, so the set of all items is
domain(k 0 ). For all other dimensions D j we can assert
that the functional dependency tid Y k j holds for F (i.e.
each value of tid in the fact table determines one and
only one vale of k j in the non-item dimension table D j ).
We define a qualifier to be a set of equalities of the
form: attr = value where attr is a non-key attribute of
any of the non-item dimensions and value is a constant. For example, {region = North} is a qualifier and
so is {season = Summer; region = South}.
We define a valid qualifier to be a qualifier that
contains at most one equality for each non-item
dimension. Examples of valid qualifiers are {season = Summer; region = South} and {region = South;
gender = Male}; while examples of invalid qualifiers
are: {region = North; region = South} and {season =
Summer; month = July}. In this paper, we only con-

865

N. Jukic´, S. Nestorov / Decision Support Systems 42 (2006) 859–878
D0
k0

sider valid qualifiers, and for brevity, refer to them as
qualifiers.
Now, we define a qualified association rule:

...

...

...

...

a1

...

e1
ei

X YY ½Q

em

where X and Y are itemsets (X o domain(k 0 ); Y o
domain(k 0 )) and Q is a qualifier. For example,
sleeping bag Y tent½region ¼ North;
season ¼ Summer

D1
k1
F
tid

k0

f1

t

e1

fi

t

ei

k1

...

kp

...

v1

...

is a qualified association rule. The intuitive interpretation of this rule is that if a transaction includes a sleeping bag in the North region during the Summer season,
then the transaction is likely to also include a tent.
Before defining the support and confidence of a
qualified association rule, we need to define the support of a qualified itemset. Let I be an itemset and Q a
qualifier. Then, I[Q] is a qualified itemset. For example, {sleeping bag, tent} [region = North, season =
Summer] is a qualified itemset. The s _count of a
qualified itemset I[Q], denoted as s_count(I[Q]), is
the number of transactions that contain the itemset and
satisfy all equalities in Q.
Formally, support is defined as follows. Let
I = {e 1;. . .; e m } and Q = {a 1 = v 1;. . .; a p = v p }. Without
loss of generality we can assume that a i is an attribute
of D i . Then, s_count(I[Q]) is equal to the size of the
following set:

S ¼ tja fi aF; i ¼ 1::m; dj aDj ; j ¼ 1::p; fi ½tid

¼ t; fi ½k0 ¼ ei ; fi kj ¼ dj kj ; dj aj ¼ vj g:
The set S consists of all transaction IDs t, for which
there are m tuples (i.e. records or table rows) in F that
satisfy the following conditions:

all m tuples have t as their tid.
all m tuples agree on all dimension key attributes
except k 0.
n for every attribute k j , its single value in all m tuples
is the value of attribute k j of a tuple in dimension
table D j whose value of attribute a j is vj .

fm

t

Dp
kp

em

...

ap

...

vp

Fig. 3. Definition of support for qualified itemsets.

bag, one for tent) in the fact table that agree on all
attributes except for k 0 (attribute k 0 points to bag for
one tuple and to tent for the other). In other words, set
S would capture transaction IDs of all transactions: (1)
where bag and tent were sold together (2) which
satisfy the qualifying condition (e.g. within a particular region and season).
Now, we can define the support and confidence of
a qualified association rules as follows:
support ð X YY ½QÞ ¼

s count ðð X [ Y Þ½QÞ
s count ðF½FÞ

confidenceð X YY ½QÞ ¼

s count ðð X [ Y Þ½QÞ
s count ð X ½QÞ

Note that s_count(F[F]) is simply the number of
transactions in the fact table F.
The support and confidence of the rules in G were
computed using these definitions.

n
n

3.1. Discussion

These conditions are illustrated in Fig. 3.
For example, for the qualified itemset I = {bag,
tent}[ Q], the set S would consist of all transaction
IDs representing transactions that satisfy the qualifying condition Q, for which there are 2 tuples (one for

While qualified association rules are natural extensions of standard association rules, they contain more
detailed and diverse information which offer several
distinct advantages.
First, the phenomenon that brings a qualified rule to
existence can be explained more easily. In Example 2,
standard association rule mining found the correlations between products A, B, and C within the specified confidence and support threshold for all sales.

866

N. Jukic´, S. Nestorov / Decision Support Systems 42 (2006) 859–878

However, only the qualified rule would give the
explanation that, once regions and seasons are considered separately, the correlation of sales between
products A, B, and C holds exclusively for products
C and A in warm weather conditions (because the
found qualified rule contains South region and Summer season).
Second, qualified rules tend to be local, in a sense of
location or time, and thus more actionable. For example, it is much simpler to devise a marketing strategy for
sleeping bags and tents that applies to a particular
region during a particular season than a general strategy
that applies to all regions during all seasons.
Third, for the same support threshold, the number
of discovered qualified association rules is likely to be
much smaller than the number of standard association
rules. The reason is that qualified rules involve more
attributes and thus transactions that support them fit a
more detailed pattern than standard rules. Even though
the overall number of qualified rules is likely to be
much less than the number of standard rules, there may
be some items that are involved only in qualified rules
(as shown in Example 1). Intuitively, standard rules
show associations of a coarser granularity while qualified rules show finer, more subtle associations.
Before we continue the discussion of qualified
association rules, let us recall the process of standard
association-rule mining. Given a set of transactions,
and support and confidence thresholds (based on
initial beliefs and intuition of the analyst), the process
discovers association rules among items that have
support and confidence higher than the chosen thresholds. With this setting, the only parameters that the
analyst controls directly are the support and confidence thresholds. Very often the result of this mining
process is one of two extremes. The number of discovered rules is either overwhelmingly large (tens of
thousands) or very small (dozens) [28]. In both cases,
the user has no other recourse but to revise the support
and confidence and run the mining process again. As
a post-processing step, the analyst may decide to
focus on certain items, but that goes against the notion
of mining and will be very inefficient; in fact, given
an unlimited amount of time, the user may very well
compute the support and confidence of the items
involved in association rules directly using OLAP.
Since the mining process often takes hours [20],
most users are not likely to continually refine the

thresholds. Thus, the mining process is often seen as
not particularly useful to non-technical users.
In contrast, the process of mining qualified association rules requires more input from the user and
thus affords more control and lends itself to customization. Since qualified association rules involve attributes of non-item dimensions, the user needs to
specify these attributes. For example, the user can
ask for association rules that involve a region and a
season in addition to products. Note that the user does
not need to specify the particular values of the attributes. The actual values are discovered as part of the
mining. This mining process differs from the one-sizefits-all approach for standard association rules. Different users can ask mining questions involving different
attributes of the dimensional hierarchies. For example,
a marketing executive who is interested in setting up a
targeted marketing campaign can ask the following:
Find products bought frequently together by customers of a particular gender in a particular region.
A sample association rule discovered by this question may be: ice cream Y cereal (region = California,
gender = female). Discovering that ice cream and cereal sell well together in California with female customers can facilitate creating a more focused and
effective marketing campaign that is at the same
time less costly (e.g. ads for ice cream and cereal
can appear on consecutive pages of a local femaletargeted magazine).
On the other hand, a vice president in charge of
distribution may ask the following:
Find products bought frequently together in a particular city during a particular month.
A sample association rule discovered by this question may be: sleeping bag Y tent (city = Oakland,
month = May). Such a rule is of more value for managing and planning the supply-chain than the equivalent standard association rule since it clearly identifies
the location and the time period (e.g. it can indicate
that the shipping of sleeping bags and tents to certain
stores should be synchronized, especially during certain time periods).
3.2. Mappings to standard association rules
Let us consider several ways in which we can try to
map qualified rules to standard rules and attempt to
mine for them in data warehouses by applying stan-

N. Jukic´, S. Nestorov / Decision Support Systems 42 (2006) 859–878

dard association-rule mining algorithms. In other
words, let us examine if there is a feasible way to
find qualified association rules in data warehouses by
using currently existing methods (i.e. standard association rules mining) included with the standard
mining tools.
The first mapping is based on considering each
value of a qualifier as just another item and adding
this item to the transaction. Thus, the problem of
finding association rules with qualified attributes a i
of dimension D i for i = 1. . .k can be converted to the
problem of finding standard association rules by adding to each transaction k new items: the values of a i .
The new set of all items becomes domain(k 0) [ ( [ki¼1
domain(a i )). For a concrete example, consider a transaction that involved a sleeping bag and a tent and
happened in the North region during the Summer
season. After the mapping, the bnewQ transaction
will contain four items: sleeping bag, tent, North,
and Summer.
There are several problems with this approach.
Essentially, the process will compute all standard
association rules for the given minimum support, all
qualified rules, and other rules that are qualified by
only a subset of the original qualifier. Thus, the
majority of the found rules will not involve all qualified dimensions and must be filtered out at the end of
the process. Furthermore, when mining qualified
rules, the range of bappropriateQ minimum support
values will be lower than the range for mining standard rules over the same data. By appropriate, we
mean that the number of discovered rules is not too
large and not too small. Consequently, a minimum
support value that is bappropriateQ for qualified rules
may not be bappropriateQ for standard rules. For such
values, this approach will be very inefficient, since it
generates all standard rules along with the qualified
rules, while using the same support values. Using the
same support threshold will make the number of
discovered standard rules many times more than the
number of qualified rules (as will be illustrated later in
Section 4 that shows the results of experiments).
Another shortcoming of this approach is that a mining
algorithm will make no distinction between attribute
values of item dimension and non-item dimension.
Thus, the algorithm may try to combine incompatible
items, e.g. North and South, as part of the candidate
generation process. Recall the basic a-priori algorithm

867

for finding frequent itemsets, employed by standard
association rule mining. The first step of the a-priori
algorithm, when applied to the first mapping, will find
all frequent itemsets of size 1. It is likely that most
dimension values will turn up frequent. The second
step of a-priori, which is usually the most computationally intensive step, will generate a candidate set of
pairs, C2 = L1 L1 whose support will be counted
(where L1 represents all frequent itemsets of size 1.)
This candidate set C2 will include many impossible
pairs formed by putting matching two values of the
same dimension attribute (e.g. {Summer; Winter}).
These shortcomings can be addressed by handling
items and dimension attributes differently, which is
in fact the very idea that motivates qualified association rules.
The second mapping of qualified to standard rules
is based on expanding the definition of an item to be a
combination of an item and qualified attributes. For
example, (tent, South, Winter) will be an item different than the item (tent, South, Summer). The main
problem with this approach is that the number of items
is increased dramatically. The number of items will
grow to:
k

jdomainðk0 Þj j ðjdomainðki ÞjÞ:
i¼1

The growth can be many orders of magnitude
greater than the original number of items, which can
make the process unmanageable. For example suppose
we consider day_of_year and zip_code attributes of
non-item dimensions. Then the number of items will
increase about 365 * 42,000~15 million times. Thus, if
the number of items was 100,000 the new number of
items will be 1.5 trillion. In addition to having to
consider an enormously inflated number of items,
mining algorithms will run into same problem of
matching up incompatible items as described above.
The third mapping is based on partitioning the data
according to the values of the qualified attributes, and
finding standard association rules for each partition.
For example, the data from Example 1 would be
partitioned into 4 different sets, one for each of the
4 combinations of values for region and season. This
approach is in fact feasible if fact tables can be prejoined with all of their dimensions and then loaded
into a separate data mining tool, where the data can be
re-partitioned for each query. While ad-hoc re-parti-

868

N. Jukic´, S. Nestorov / Decision Support Systems 42 (2006) 859–878

tioning of the data on-demand (e.g. every time another
set of qualified association rules is mined), may in fact
be a feasible strategy for single fact table data-marts
that contain relatively small amounts of data, this
approach is simply not an option for most corporate
data warehouses where fact tables can contain hundreds of millions or billions of records occupying
terabytes of disk space. Even if we store separately
the results of joining each fact table with all their
related dimensions as a single table, the number of
partitions in each such table may become too large
and the startup cost of setting up each partition and
running an algorithm on it would dominate the total
cost. Furthermore, when this approach is applied
within the data warehouse (i.e. directly on the fact
and dimension tables), the run on each partition will
access the entire data (reading only the blocks that
contain data from the current partition will result in
many random access disk reads that are slower than
reading the entire data sequentially) unless the data
happens to be partitioned physically along the qualified dimensions.
As we described here, none of the discussed
approaches that involve applying standard association-rule mining directly, provide a practical and
efficient solution for discovering qualified association
rules in data warehouses. However, all three mappings emphasize the need for a new method that
treats items and non-item dimension attributes differently. The principle of treating items and non-item
dimension attributes differently is what underlines
our definition of qualified association rules. In the
remainder of this paper we present a framework and
methodology for finding qualified association rules
efficiently.
3.3. Framework—a-priori for qualified itemsets
For standard itemsets, the a-priori technique is
straightforward. For any frequent itemset I, all of its
subsets must also be frequent. If the size of I is n (i.e.
I consists of n items) then I has 2n
1 proper subsets.
Each of these subsets must be frequent. Conversely, if
any of these subsets is not frequent, then I cannot be
frequent. The situation is more complicated for qualified itemsets. Consider a qualified itemset I[ Q]. Let
the size of I be n and let the size of Q, which is the
number of qualified dimension attributes, be q. If

I[ Q] is frequent then it follows directly from the
definition of support for qualified itemsets that any
qualified itemset I V[Q V] must also be frequent, where
I Vp I and Q Vp Q. The number of such qualified
itemsets is 2n+q
1. For example, suppose that for
a given data {bag, tent} [season = Summer, region =
North] is frequent (frequent means more than w
transactions where w is the minimum support, e.g.,
w = 10,000). Then, there are 15 other qualified itemsets that must also be frequent. Let us examine four of
them along with their interpretations:
! F[region = North] is frequent means that there
must be more than w transactions that occurred
in the North.
! {bag}[F] is frequent means that there must be
more than w transactions involving a bag.
! {tent}[season = Summer, region = North] is frequent means that there must be more than w transactions involving a tent that occurred in the North
during the Summer.
! {bag, tent}[season = Summer] is frequent means
that there must be more than w transactions involving a bag and a tent that occurred during the
Summer.
Intuitively, this shows that a-priori leads to more
dddiverseTT set of statements for qualified itemsets than
for standard itemsets. Next we formalize this intuition.
Let the set of all frequent standard itemsets of size
k be L k . According to a-priori, any subset of any
itemset in L k must also be frequent. Thus, any itemset
of size l, l b k, that is a subset of an itemset in L k , must
be in L l . We shall say that L k succeeds L l , for l b k,
and write L k d L l . Note that d is a total ordering of all
L i . This ordering dictates the order in which L i should
be computed.
Consider qualified itemsets. Let L k[A] be the set of
all frequent itemsets of size k qualified in all dimension attributes in A. According to a-priori, any qualified itemset I[ Q] that is a subset of a qualified itemset
in L k[A] must also be frequent. If the size of I is l, and
the set of attributes in Q V is A V then I V[Q V] must
belong to L k[A’]. While in the case of standard itemsets
there are only k such sets, for qualified itemsets there
are (k + 1)2a
1 such sets, where a is the size of A.
We shall say that L k[A] succeeds L l[A’] and write L k[A] d
L l[A’] iff k z l, A t AV, and (k, A) p (l, AV).

869

N. Jukic´, S. Nestorov / Decision Support Systems 42 (2006) 859–878

The following
itemsets.

example

illustrates

! L 2[region] = { AB[N], AB[S], AC[S], BC[N]}
! L 2[season,region] = {AC[Sm,S]}

qualified

Example 3. This example shows the values of all
L k[ Q]UL 2[season,region] for the data from Example 2 within
the specified support threshold (0.5% which is 3000
transactions). For brevity we denote qualified itemsets
by concatenating items, e.g. AB instead of {A, B}, and
replacing equalities by just the abbreviated attribute
values, e.g. [Sm] instead of [season =Summer].
!
!
!
!
!
!
!
!
!
!

Note that there is no total ordering of these sets,
for example, L 1[season,region] d L 2[season] (i.e. neither
L 1[season,region] d L 2[season] nor L 1[season,region] L 2[season]
holds.). All L k[ A] succeeded by L 2[season,region] are organized in the hierarchy shown in Fig. 4.
Since d defines only a partial order over L k[ A],
there are many possible sequences in which they
may be computed. Consider our running example of
finding pairs of products bought frequently together
in the same region and during the same season. One
possibility is to first find all frequent products
(L 1[F]) Then, find frequent products for a particular
season (L 1[season]), then frequent products for a particular season and region (L 1[season,region]), and finally
pairs of products for a particular season and region
(L 2[season,region]). Another possibility is to first find
frequent products for a particular region (L 1[region]),
then frequent products for a particular season
(L 1[season]), the frequent pairs of products (L 2[F]), and
finally pairs of products for a particular season and
region (L 2[season,region]).

L 0[F] = L 0 = {F}
L 0[season] = {F[Wi], F[Sm]}
L 0[region] = {F[N], F[S]}
L 0[season, region] = {F[Wi,N], F[Sm,N], F[Wi,S],
F[Sm,S]}
L 1[F] = L 1 = {A, B, C}
L 1[season] = {A[Wi], A[Sm], B[Wi], B[Sm], C[Wi],
C[Sm]}
L 1[region] = { A[N], A[S], B[N], B[S], C[N], C[S]}
L 1[season,region] = {B[Wi,N], C[Wi,N], C[Sm,N],
A[Sm,S], B[Sm,S], C[Sm,S]}
L 2[F] = L 2 = {AB, AC, BC}
L 2[season] = {AB[Wi], AB[Sm], AC[Wi], AC[Sm],
BC[Wi], BC[Sm]}

L2[season,region]

L2[season]
L2[
L1[season,region]
L1[region]
L1[season]
L1[

]

0

L [season, region]

L0[season]

L0[region]
L0[

]

Fig. 4. Lattice of successors.

]

L2[region]

870

N. Jukic´, S. Nestorov / Decision Support Systems 42 (2006) 859–878

Note that there are no such alternatives when standard association rules are used. Optimization techniques based on a-priori are essentially rule-based since
the actual sizes of the initial data and intermediate
result do not influence the sequence of steps. In fact,
the sequence is always the same.
For qualified association rules, the sequence of
steps is not always the same because of different
possible alternatives as illustrated above. Intuitively,
the source of alternatives is the fact that non-item
dimensions are not symmetric and comparable.
Before we describe an algorithm for choosing an
efficient sequence (in Section 6.2), we discuss the
individual steps and their interaction within a
sequence. The crux of our method (and of a-priori
for that matter) is to use the result of a simpler query
(which represents a portions of the entire query) to
speed up the computation of the entire query. For
example, if we are looking for pairs of item qualified
by season and region (L 2[season,region]), we can eliminate all tuples from F that do not contribute to the
computation of frequent items qualified by season
(L 1[season]). Then, we can use the reduced fact table
to compute the qualified pairs. Generally, every step
in our framework has two parts:
L k[A],

1. compute
2. use the computed compute L k[A] to reduce F.
This framework is similar to query flocks [26] and
can be adapted to handle iterative, candidate-growing,
algorithms, such as a-priori [1]. Instead of reducing
the size of the fact table in step 2, we can use L k[A] to
generate the new candidate set for the next iteration of
the algorithm.
Next, we detail the two-part step and the interaction within a sequence.
k
3.3.1. Computing L[A]
First, we express L k[A] in relational algebra. Without
loss of generality, we assume that A = {A 1,. . ., A m }
and A i is attribute of dimension table D i .

rcountðTIDÞN¼w ðcIk ;A ð Fk ZEDJ ÞÞ
where F k is a k-way self-join of the fact table F
on TID, I k is a set of the keys (K 0) of the item
dimension from each of the k copies of the fact

table, DJ is a cross product of D 1, . . . ,D m , g is
the aggregation operator, and w is the minimum
support.
In SQL, L k[A] can be computed as follows:
SELECT F1.K0, . . . , Fk.K0, A1, . . . , Am
FROM F AS F1, . . . , F AS Fk, D1, . . . , Dm
WHERE F1.TID = F2.TID. . .
AND F1.TID = Fk.TID
AND F1.K1 = D1.K1. . .
AND F1.Km = Dm.Km
GROUP BY F1.K0, . . . , Fk.K0, A1, . . . , Am
HAVING COUNT (DISTINCT F1.TID)N=:minsup
3.3.2. Reducing the fact table F with L k[A]
First, consider the case when k = 1 and A = F Then
L 1[F], as a relation, has only one attribute, K 0. The fact
table is reduced by semijoining it with L 1[F]:

pr
F ZE
count ðTIDÞN¼w cK0 ð F Þ

Let A = {A 1,. . ., A m }. Then L 1[A] has, in addition to
K 0, attributes A 1,. . ., A m . Therefore, the reduction of
F has to involve DJ in addition to L 1[A]:
p ðDJ ZEr
F ZE
count ðTIDÞN¼w ðcK0 ;A ð Fk ZEDJ ÞÞÞ
In the general case, for k N 1, where I 1 is the
first attribute in I k , we can express the reduction as
follows:

X ¼ Fk ZEDJ ZErcountðTIDÞN¼w cIk ;A ð Fk ZEDJ Þ

pk
F ZE
TID;I1 ;KI ...;Km ð X Þ:

Intuitively, we need the TID in the last expression,
in order to connect the items in the qualified itemsets.
3.3.3. Computing a sequence of L k[A]
So far, we considered computing L k[A] individually.
In this section we consider, what happens when we
compute several of them. In particular, we examine
how to use a computed L k[A] to speed up the computation of another L l[B]. We consider all three possible
cases:
! L k[A] d L l[B]. In this case, using the computed set of
frequent qualified itemsets does not help. The rea-

N. Jukic´, S. Nestorov / Decision Support Systems 42 (2006) 859–878

son is that L l[B] can be arbitrarily large regardless of
the size L k[A]. In an extreme case L k[A] can be empty
and L l[B] can be as large as possible.
! L k[A] L l[B]. This is the case that corresponds to the
condition in a-priori. The computed set can be used
in computing L l[B] as shown in a-priori.
! L k[A] d L l[B]. In this case, using an argument
similar to the first case, we can show that the
computed set does not help.
Thus, the order of computation affects the actual
result, so we will use a different notation to denote
that the computation of the frequent qualified itemsets
is happening in context of other computations. We
define RAk to be the computation of L k[A] on the current
reduced fact table F:

RkA ¼ rcountðTIDÞN¼w cI k ; A Fkred ZEDJ

where Fkred is a k-way self join of what remains of the
fact table after the previous reduction. For the first
reduction Fkred = F k .
3.4. Optimization algorithm
We can show that the plan space for the optimization problem is double-exponential in number of
qualified dimensions. Consider all reducers at level
1. Since we have m dimensions and a reducer
involves any subset of these dimensions, then the
number of possible reducers is 2m . A plan involves
any subset of all possible reducers, so the number
m
of possible plans for one level is 22 . If we consider
n levels, then the total number of possible plans is
m
2n2 .
The number of possible reducers is n2m where n is
the maximum itemset size and m is the number of
qualified dimensions. Since a plan is any subset of
m
these red

Comprehensive data warehouse exploration. pdf

Dokumen yang terkait

Pengamanan data dengan enkripsi triple des dan kompresi lzss; studi kasus; BiNus Center Bintaro

Upaya meningkatkan pemahaman konsep bangun ruang sisi data dengan menggunakan media manipulatif

Tinjauan koordinasi dalam meningkatkan proses pengolahan data gaji pusat pada pengolahan data dan informasi anggatan (PPDIA) Bandung

Sistem Pendukung Keputusan Menggunakan Basis data FUZZY Model Tahani Untuk Membantu Pemilihan Telepon Seluler

Pengaruh sistem pengolahan data realisasi anggaran pendapatan dan belanja negara terhadap akurasi laporan perhitungan anggaran negara di pusat pengolahan data dan informasi anggaran Bandung

Laporan hasil praktek kerja lapangan di progam usaha data dan informasi pada perum perumnas regional IV Jl. Surapati No.120 Bandung

Sistem pengolahan data pelanggan di seksi program penjualan PT.Telkom Kancatel Purwakarta : laporan hasil kerja praktek di seksi program penjualan Dinas Pelayanan PT.Telkom Kancatel Purwakarta jalan K.K. Singawinata No.106 Purwakarta 41111

Pengolahan data gempa bumi wilayah pulau Jawa pada Sub Dit Gempat Bumi Direktorat Vulkanologi dan Mitigasi Bencana Geologi Bandung : laporan kerja praktek

Sistem informasi pengolahan data gaji pegawai pada Dinas Bina Marga Jalan Aisa Afrika No.79 Bandung : laporan hasil praktek kerja lapangan

Pengolahan data kiriman luar negeri pada Bagian Pos Internasional di SPP (Sentral Pengolahan Pos) Bandung 40400 : laporan kerja praktek

Dukungan

Links

Comprehensive data warehouse exploration. pdf

Dokumen yang terkait

Pengamanan data dengan enkripsi triple des dan kompresi lzss; studi kasus; BiNus Center Bintaro

Upaya meningkatkan pemahaman konsep bangun ruang sisi data dengan menggunakan media manipulatif

Tinjauan koordinasi dalam meningkatkan proses pengolahan data gaji pusat pada pengolahan data dan informasi anggatan (PPDIA) Bandung

Sistem Pendukung Keputusan Menggunakan Basis data FUZZY Model Tahani Untuk Membantu Pemilihan Telepon Seluler

Pengaruh sistem pengolahan data realisasi anggaran pendapatan dan belanja negara terhadap akurasi laporan perhitungan anggaran negara di pusat pengolahan data dan informasi anggaran Bandung

Laporan hasil praktek kerja lapangan di progam usaha data dan informasi pada perum perumnas regional IV Jl. Surapati No.120 Bandung

Sistem pengolahan data pelanggan di seksi program penjualan PT.Telkom Kancatel Purwakarta : laporan hasil kerja praktek di seksi program penjualan Dinas Pelayanan PT.Telkom Kancatel Purwakarta jalan K.K. Singawinata No.106 Purwakarta 41111

Pengolahan data gempa bumi wilayah pulau Jawa pada Sub Dit Gempat Bumi Direktorat Vulkanologi dan Mitigasi Bencana Geologi Bandung : laporan kerja praktek

Sistem informasi pengolahan data gaji pegawai pada Dinas Bina Marga Jalan Aisa Afrika No.79 Bandung : laporan hasil praktek kerja lapangan

Pengolahan data kiriman luar negeri pada Bagian Pos Internasional di SPP (Sentral Pengolahan Pos) Bandung 40400 : laporan kerja praktek

Dokumen yang Anda mencari sudah siap untuk unduhkan