COP 5725 Fall 2012 Database Management Systems
COP 5725 Fall 2012 Database Management
Systems
University of Florida, CISE Department Prof. Daisy Zhe Wang Adapted Slides from Prof. Jeff Ullman 1 Data Warehousing and Data Mining Warehousing OLAP
Data Mining
2 Introduction
organizations are analyzing
- Increasingly, current and historical data to identify useful patterns and support business strategies.
- Emphasis is on complex, interactive, exploratory analysis of very large datasets created by integrating data from across all parts of an enterprise; data is fairly sta
- On-Line Analytic Processing (OLAP) vs. On-line
Transaction Processing (OLTP)
OLTP
- Most database operations involve
On- (OTLP).
Line Transaction Processing
- – Short, simple, frequent queries and/or modifications, each involving a small number of tuples.
- – Examples: Answering queries from a Web interface, sales at cash registers, selling airline tickets.
OLAP
- Of increasing importance are
On-Line Application Processing (OLAP) queries.
- – Few, but complex queries --- may run for hours.
- – Queries do not depend on having an absolutely up-to-date database.
OLAP Examples
1. Amazon analyzes purchases by its customers to come up with an individual screen with products of likely interest to the customer.
Common Architecture
- Databases at store branches handle OLTP queries.
- Local store databases copied to a central warehouse overnight.
- Analysts use the warehouse for OLAP and data mining.
Three Complementary Trends
- Data Warehousing: Consolidate data from many sources in one large repository.
- – Loading, periodic synchronization of replicas.
- – Semantic integration.
- OLAP: – Complex SQL queries and views.
- – Queries based on spreadsheet-style operations and “multidimensional” view of data.
- – Interactive and “online” queries.
- Data Mining: Exploratory search for interesting trends and anomalies.
SOURCES Data Warehousing
EXTRACT
- Integrated data spanning
TRANSFORM
long time periods, often
augmented with summary information.
- Several gigabytes to
DATA Metadata WAREHOUSE terabytes common. Repository
- Interactive response
SUPPORTS
times expected for complex queries; ad-hoc updates uncommon.
Warehousing Issues
When getting data from
- Semantic Integration:
multiple sources, must eliminate mismatches, e.g., different currencies, schemas.
- Heterogeneous Sources: Must access data
from a variety of source formats and repositories.
Must load data,
- Load, Refresh, Purge: periodically refresh it, and purge too-old data.
Must keep track of
- Metadata Management:
source, loading time, and other information for all data in the warehouse.
id Multidimensional s id me ti loc sale pid
Data Model
11 1 1 25
measures, which
- Collection of numeric
11 2 1 8 depend on a set of dimensions. 11 3 1 15
Sales , dimensions
- – E.g., measure
12 1 1 30
Product (key: pid), Location (locid), and Time (timeid).
12 2 1 20 12 3 1 50
8 10 10
Slice locid=1 13 1 1 8
pid 30 20 50
is shown: 13 2 1 10
25 8 15 11 12 13
13 3 1 10
locid 1 2 3
11 1 2 35
timeid MOLAP vs. ROLAP systems
(bar, beer, drinker, time, price) MOLAP and Data Cubes
systems: Multidimensional data are stored physically in a (disk-resident, persistent) multi dimensional array.
- MOLAP
- ROLAP systems: Multidimensional data are stored as relations.
- – The main relation, which relates dimensions to a
measure, is called the fact table . Each dimension
can have additional attributes and an associated
dimension table . - – E.g., Sales(pid, locid, timeid, sales) or Sales
- Keys of dimension tables are the dimensions of a hypercube.
- – Example: for the Sales (bar,beer,drinker,time,price) data, the four dimensions are bar , beer , drinker , and time .
- Dependent attributes (e.g., price ) appear at the points of the cube.
Visualization - Data Cubes price bar beer
Time?
4 th dimension Data Cube Marginals
- The data cube also includes aggregation (typically SUM) along the margins of the cube.
- The include aggregations
marginals over one dimension, two dimensions,… Visualization - Data Cube w/ Aggregation beer price bar
ROLAP and Star Schemas- A is a common organization
star schema for data at a warehouse. It consists of:
1. Fact table : a very large accumulation of facts such as sales.
Often “insert-only.”
: smaller, generally static
2. Dimension tables
information about the entities involved in the facts. Example: Star Schema
- Suppose we want to record in a warehouse information about every beer sale: the bar, the brand of beer, the drinker who bought the beer, the day, the time, and the price charged.
- The fact table is a relation:
Sales(bar, beer, drinker, day, time, price)
Example, Continued
- The dimension tables include information about the bar, beer, and drinker “dimensions”:
Bars(bar, addr, license) Beers(beer, manf) Drinkers(drinker, addr, phone)
Visualization – Star Schema
Dimension Table (Bars) Dimension Table (Drinkers)
Dimension Attrs. Dependent Attrs.Fact Table - Sales
Dimension Table (Beers) Dimension Table (Time, etc.) Dimensions and Dependent Attributes
- Two classes of fact-table attributes:
: the key of a
1. Dimension attributes dimension table.
: a value
2. Dependent attributes
determined by the dimension attributes of the tuple. category week month state pname date city year quarter country
Dimension Hierarchies
- For each dimension, the set of values can be organized in a hierarchy:
OLAP Queries • Influenced by SQL and by spreadsheets. aggregate a
- A common operation is to measure over one or more dimensions.
- – Find total sales.
- – Find total sales for each city, or for each state.
- – Find top five products ranked by total sales.
Aggregating at different levels of a
- Roll-up: dimension hierarchy.
– E.g., Given total sales by city, we can roll-up to get
sales by state.
OLAP Queries The inverse of roll-up.
- Drill-down:
- – E.g., Can also drill-down on different dimension to get total sales by product for each year/quarter, etc.
– E.g., Given total sales by state, can drill-down to get
total sales by city.
Aggregation on selected dimensions.
- Pivoting:
WI CA Total
- – E.g., Pivoting on Location and Time yields this :
cross-tabulation 63 81 144 1995 1996 38 107 145
Slicing and Dicing: Equality and range selections on one
1997 75 35 110 or more dimensions. 176 223 339 Total
63 81 144 1995
Comparison with 1996 38 107 145
SQL Queries 1997 75 35 110
- The cross-tabulation
176 223 339 Total
obtained by pivoting
WI CA Total
SELECT SUM (S.sales)
FROMSales S, Times T, Locations L
WHERE S.timeid=T.timeid AND S.timeid=L.timeid
GROUP BYT.year, L.state
SELECT SUM (S.sales) SELECT SUM (S.sales)
FROMFROM Sales S, Times T Sales S, Location L
WHERE S.timeid=T.timeid WHERE S.timeid=L.timeid
GROUP BYT.year L.state The CUBE Operator
- Generalizing the previous example, if there are k dimensions, we have 2^k possible SQL
queries that can be generated through pivoting on a subset of dimensions.
- CUBE pid, locid, timeid BY SUM Sales
- – Equivalent to rolling up Sales on all eight subsets of the set {pid, locid, timeid};
- – each roll-up corresponds to an SQL query of the form:
SELECT SUM (S.sales) FROM Sales S GROUP BY grouping-list Design Issues • Fact table in BCNF; dimension tables un-normalized.
- – Dimension tables are small; updates/inserts/deletes are rare.
So, anomalies less important than query performance.
price category pname pid country state city locid sales locid timeid pid holiday_flag week date timeid month quarter year
(Fact table) SALES TIMES PRODUCTS
LOCATIONS
- This kind of schema is very common in OLAP applications, and is called a star schema ; computing the join of all these relations is called a star join .
ROLAP Techniques
1. New indexing techniques: Bitmap indexes, Join indexes
2. Array representations, compression,
3. Pre-computation of aggregations (i.e., materialized views), etc.
4. We are going to cover in more details on:
- Bitmap indexes
- Materialized views
Bitmap Index sex rating F M custid name sex rating Bit-vector:
112 Joe M 3
10 00100
1 bit for each
115 Ram M 5
10 00001 possible value.
119 Sue F
5
01 00001
Many queries can be answered using
10 00010
112 Woo M 4
bit-vector ops! For each key value of a dimension table create a bit-vector telling which tuples of the fact table have that value.
Example OLAP Query
- Often, OLAP queries begin with a “ star join ”: the natural join of the fact table with all or most of the dimension tables.
- Example: SELECT bar, beer, SUM(price) FROM Sales NATURAL JOIN Bars NATURAL JOIN Beers WHERE addr = ’Palo Alto’ AND manf = ’Anheuser-Busch’
GROUP BY bar, beer;
Materialized Views
- Store the answers to several useful queries (views) in the warehouse itself.
- A direct execution of an OLAP query from the fact and the dimension tables could take too long (even with bitmap indexes)
- If we create a materialized view that contains enough information, we may be able to answer our query much faster.
Materialized Views (cont.)
- A view whose tuples are stored in the database is said to be materialized .
- – Provides fast access, like a (very high-level) cache.
- – Need to maintain the view as the underlying tables change.
- – Ideally, we want incremental view maintenance algorithms.
Example OLAP Query
SELECT bar, beer, SUM(price) FROM Sales NATURAL JOIN Bars NATURAL JOIN Beers WHERE addr = ’Palo Alto’ AND manf = ’Anheuser-Busch’ GROUP BY bar, beer;
Example --- Continued
- Here is a materialized view that could help:
CREATE VIEW BABMS(bar, addr, beer, manf, sales) AS
SELECT bar, addr, beer, manf,
SUM(price) sales FROM Sales NATURAL JOIN Bars NATURAL JOIN Beers GROUP BY bar, addr, beer, manf;Since bar -> addr and beer -> manf, there is no real grouping. We need addr and manf in the SELECT. Example --- Concluded
- Here’s our query using the materialized view BABMS: SELECT bar, beer, sales FROM BABMS WHERE addr = ’Palo Alto’ AND manf = ’Anheuser-Busch’;
Views in DW for OLAP queries • OLAP queries are typically aggregate queries.
– Precomputation is essential for interactive response
times.- – The CUBE is in fact a collection of aggregate queries, and precomputation is especially important
– lots of work on what is best to precompute given a
limited amount of space to store precomputed results.- Warehouses can be thought of as a collection of periodically updated tables and periodically maintained views.
View Modification (Evaluate On Demand)
CREATE VIEW RegionalSales (category,sales,state)
AS SELECT P.category, S.sales, L.state View
FROM Products P, Sales S, Locations L
WHERE AND P.pid=S.pid S.locid=L.locid
SELECT SUM R.category, R.state, (R.sales) Query
FROM AS GROUP BY RegionalSales R R.category, R.state
SELECT SUM
R.category, R.state, (R.sales)FROM SELECT ( P.category, S.sales, L.state
Modified FROM
Products P, Sales S, Locations L Query
WHERE AND AS P.pid=S.pid S.locid=L.locid ) R
GROUP BY R.category, R.state View Materialization (Precomputation)
- Suppose we precompute RegionalSales and store it with a clustered B+ tree index on [category,state,sales].
- – Then, previous query can be answered by an index- only scan.
SELECT SUM R.category, R.state, (R.sales)
RegionalSales R R.category, R.state View Materialization (Precomputation)
- Suppose we precompute RegionalSales and store it with a clustered B+ tree index on [category,state,sales].
- – Then, previous query can be answered by an index- only scan.
SELECT R.state,
SUM (R.sales)
FROM RegionalSales R
WHERE R.category =“Laptop”
R.state SELECT
R.state, SUM
(R.sales) FROM
RegionalSales R WHERE
R. state=“Wisconsin”
R.category Index on precomputed view is great! Index is less useful (must scan entire leaf level). Data Mining Definition Data mining is the exploration and analysis of large quantities of data in order to discover , , valid novel potentially , and ultimately useful understandable patterns in data.
Example pattern (Census Bureau Data): If (relationship = husband), then (gender = male). 99.6%
Data Mining Definition (Cont.)
Data mining is the exploration and analysis of large quantities of
data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data.Valid : The patterns hold in general. Novel : We did not know the pattern beforehand.
Useful : We can devise actions from the patterns.
Understandable : We can interpret and comprehend the patterns.
Why Use Data Mining Today?
Human analysis skills are inadequate:
- – Volume and dimensionality of the data
- – High data growth rate
– 3V in Big Data: Volume, Velocity, Variety
Availability of:
- – Data – Storage – Computational power
- – Off-the-shelf software
- – Expertise
Preprocessing and Mining
Original Data Target Data Preprocessed
Data Patterns Knowledge
Data Integration and Selection Preprocessing
Model Construction Interpretation Example Application: Sports
IBM Advanced Scout analyzes NBA game statistics
- – Shots blocked
- – Assists – Fouls • Google: “IBM Advanced Scout”
Advanced Scout
- Example pattern: An analysis of the data from a game played between the New York Knicks and the Charlotte Hornets revealed that “
When Glenn Rice played the shooting guard position, he shot 5/6 (83%) on jump shots."
- Pattern is interesting:
The average shooting percentage for the Charlotte Hornets during that game was 54%.
Data Mining Techniques
- Supervised learning
- – Classification and regression
- Unsupervised learning
- – Clustering
- Dependency modeling
- – Associations, summarization, causality
• Trend analysis and change detection
Market Basket Analysis (Example Dependency Modeling)
- Consider shopping cart filled with several items
- Market basket analysis tries to answer the following questions:
- – Who makes purchases?
- – What do customers buy together?
- – In what order do customers purchase items?
Market Basket Analysis
Given:
- A database of customer transactions
- Each transaction is a set of items
- Example:
Transaction with TID 111 contains items {Pen, Ink,
TID CID Date Item Qty 111 201 5/1/99 Pen 2 111 201 5/1/99 Ink 1 111 201 5/1/99 Milk 3 111 201 5/1/99 Juice 6 112 105 6/3/99 Pen 1 112 105 6/3/99 Ink 1 112 105 6/3/99 Milk 1 113 106 6/5/99 Pen 1 113 106 6/5/99 Milk 1 114 201 7/1/99 Pen 2 114 201 7/1/99 Ink 2 114 201 7/1/99 Juice
4
Milk, Juice}
Market Basket Analysis (Contd.)
- Coocurrences
- – 80% of all customers purchase items X, Y and Z together.
- Association rules
- – 60% of all customers who purchase X and Y also buy Z.
- Sequential patterns
- – 60% of customers who first buy X also purchase Y within three weeks.
Confidence and Support We prune the set of all possible association rules using two interestingness measures: of a rule:
- Confidence
- – X Y has confidence c if P(Y|X) = c
- Support of a rule:
- – X Y has support s if P(XY) = s
We can also define of an itemset (a co-ocurrence)
- Support
XY:
- – XY has support s if P(XY) = s
Example
Examples:
- {Pen} => {Milk}
Support: 75% Confidence: 75%
- {Ink} => {Pen}
Support: 75%
TID CID Date Item Qty 111 201 5/1/99 Pen 2 111 201 5/1/99 Ink 1 111 201 5/1/99 Milk 3 111 201 5/1/99 Juice 6 112 105 6/3/99 Pen 1 112 105 6/3/99 Ink 1 112 105 6/3/99 Milk 1 113 106 6/5/99 Pen 1 113 106 6/5/99 Milk 1 114 201 7/1/99 Pen 2 114 201 7/1/99 Ink 2 114 201 7/1/99 Juice
Confidence: 100%
4 TID CID Date Item Qty 111 201 5/1/99 Pen 2 111 201 5/1/99 Ink 1 111 201 5/1/99 Milk 3 111 201 5/1/99 Juice 6 112 105 6/3/99 Pen 1 112 105 6/3/99 Ink 1 112 105 6/3/99 Milk 1 113 106 6/5/99 Pen 1 113 106 6/5/99 Milk 1 114 201 7/1/99 Pen 2 114 201 7/1/99 Ink 2 114 201 7/1/99 Juice
4 TID CID Date Item Qty 111 201 5/1/99 Pen 2 111 201 5/1/99 Ink 1 111 201 5/1/99 Milk 3 111 201 5/1/99 Juice 6 112 105 6/3/99 Pen 1 112 105 6/3/99 Ink 1 112 105 6/3/99 Milk 1 113 106 6/5/99 Pen 1 113 106 6/5/99 Milk 1 114 201 7/1/99 Pen 2 114 201 7/1/99 Ink 2 114 201 7/1/99 Juice
Example
- Find all itemsets with support >= 75%?
4 Market Basket Analysis: Applications
Example
- Can you find all association rules with support >= 50%?
- Sample Applications
- – Direct marketing
- – Floor/shelf planning
- – Web site layout
- – Cross-selling
Frequent Itemset Algorithms
- Applications • More abstract problem
- Breadth-first search
Applications of Frequent Itemsets
- Market Basket Analysis • Association Rules • Classification (especially: text, rare classes)
- Seeds for construction of Bayesian Networks • Collaborative filtering
Problem Abstract:
Concrete:
- A set of items {1,2,…,k}
- I = {milk, bread, cheese, &hell
- A dabase of transactions
- D = { {milk,bread,cheese},
(itemsets ) D={T1, T2, …, Tn}, Tj subset {1,2,…,k}
{bread,cheese,juice }, …} GOAL:
GOAL: Find all itemsets that appear in at
Find all itemsets that appear in least x transactions at least 1000 transactions
(“appear in” == “are subsets of”) I subset T: T supports
I {milk,bread,cheese} supports {milk,bread}
For an itemset I, the number of transactions it appears in is called the support of I. x is called the minimum support .
Problem (Contd.) Definitions:
Example: frequent if it is D={ {1,2,3}, {1,2,3}, {1,2,3},
- An itemset is a subset of at least x {1,2,4} } transactions. (FI.) Minimum support x = 3
- An itemset is maximally
frequent if it is frequent and {1,2} is frequent it does not have a frequent
{1,2,3} is maximal frequent superset. (MFI.) Support( {1,2} ) = 4
GOAL: Given x, find all frequent All maximal frequent itemsets: (maximally frequent)
{1,2,3} itemsets (to be stored in the FI (MFI) ).
Obvious relationship: MFI subset FI
The Itemset Lattice
{} {2} {1} {4} {3} {1,2}
{2,3} {1,3} {1,4} {2,4} {1,2,3,4} {1,2,3}
{3,4} {1,2,4} {1,3,4} {2,3,4} Frequent Itemsets Frequent itemsets
Infrequent itemsets {} {2} {1} {4} {3} {1,2}
{2,3} {1,3} {1,4} {2,4} {1,2,3,4} {1,2,3} {3,4} {1,2,4} {1,3,4} {2,3,4} Breath First Search: 1-Itemsets {} {1} {2} {3} {4} {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Infrequent {1,2,3,4}
Frequent The Apriori Principle:
Currently examined I infrequent ( I union {x} ) infrequent
Don’t know Breath First Search: 2-Itemsets {}
{1} {2} {3} {4}
{1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Infrequent {1,2,3,4}Frequent Currently examined Don’t know Breath First Search: 3-Itemsets {} {1} {2} {3} {4} {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Infrequent {1,2,3,4}
Frequent The Apriori Principle:
Currently examined I infrequent ( I union {x} ) infrequent
Don’t know Breadth First Search: Remarks
- We prune infrequent itemsets and avoid to count them
- To find an itemset with k items, we need to
k
count all 2 subsets
- Breadth first search uses Apriori algorithm:
Next, we show how to implement A-Priori in SQL
Finding Frequent Pairs
• The simplest case is when we only want
to find “frequent pairs” of items.- Assume data is in a relation Baskets(basket, item) .
- The
is the support threshold s minimum number of baskets in which a
pair appears before we are interested.
65 Frequent Pairs in SQL Look for two Basket tuples
SELECT b1.item, b2.item
with the same basket and
FROM Baskets b1, Baskets b2 different items. WHERE b1.basket = b2.basket
First item must precede second,
AND b1.item < b2.item
so we don’t count the same
GROUP BY b1.item, b2.item pair twice. HAVING COUNT(*) >= s;
Create a group for Throw away pairs of items each pair of items that do not appear at least that appears in at s times. least one basket. 66 A-Priori Trick --- (1)
- Straightforward implementation involves a join of a huge Baskets relation with itself.
- The
speeds the a-priori algorithm query by recognizing that a pair of items { i , j } cannot have support s unless both { } and { } do. i j
67 A-Priori Trick --- (2)
to hold only
- Use a materialized view information about frequent items.
INSERT INTO Baskets1(basket, item) SELECT * FROM Baskets
Items that
WHERE item IN (
appear in at least s baskets.
SELECT item FROM Baskets GROUP BY item HAVING COUNT(*) >= s ); 68
- Computing
- Baskets1
- – Running time shrinks with the square of the number of tuples involved in the join.
69 A-Priori Algorithm 1. Materialize the view Baskets1 .
Baskets1 is cheap, since it doesn’t involve a join.
probably has many fewer tuples than Baskets . Two Observations
- (if x>=y, f(x)>=f(y))
Monotonic function
- (if x>=y, f(x)<=f(y)) Antimonotonic fn.
- can be applied to any constraint P
Apriori
that is antimonotone (e.g., support>const) – Start from the empty set.
supersets of sets that do not satisfy P.
- – Prune
- can also be applied to a monotone
Apriori
constraint Q (e.g., sum>const) – Start from set of all items instead of empty set.
subsets of sets that do not satisfy Q.
- – Prune
Negative Pruning an Antimonotone P
{} {1} {2} {3} {4} {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4}
Frequent Infrequent {1,2,3,4}
Currently examined Don’t know Negative Pruning an Antimonotone P
{} {1} {2} {3} {4} {1,2} {1,3} {1,4} {2,3} {2,4} {3,4}
{1,2,3} {1,2,4} {1,3,4} {2,3,4} Frequent Infrequent
{1,2,3,4} Currently examined Don’t know Negative Pruning an Antimonotone P
{} {1} {2} {3} {4}
{1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Frequent Infrequent {1,2,3,4} Currently examined Don’t know
Negative Pruning a Monotone Q {}
{1} {2} {3} {4} {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4}
Satisfies Q Doesn’t satisfy Q {1,2,3,4} Currently examined Don’t know The New Problem
New Goal:
antimonotone
- Given constraints P and Q, with P (support) and Q monotone (statistical constraint).
- Find all itemsets that satisfy both P and Q.
Recent solutions:
- Newer algorithms can handle both P and Q
Conceptual Illustration of Problem {} All supersets satisfy Q
Satisfies Q Satisfies P & Q Satisfies P
All subsets satisfy P
D
Summary
- Decision support is an emerging, rapidly growing subarea of databases.
- Involves the creation of large, consolidated data repositories called data warehouses.
- Warehouses exploited using sophisticated analysis techniques: complex SQL queries and OLAP “multidimensional” queries (influenced by both SQL and spreadsheets).
- New techniques for database design, indexing, view maintenance, and interactive querying need to be supported.
Summary (Cont.)
- Data Mining
- – Supervised – Unsupervised – Dependency modeling
- – Outlier detection
- – Trend analysis and prediction