COP 5725 Fall 2012 Database Management Systems

COP 5725 Fall 2012 Database Management

Systems

University of Florida, CISE Department Prof. Daisy Zhe Wang Adapted Slides from Prof. Jeff Ullman ₁ Data Warehousing and Data Mining Warehousing OLAP

Data Mining

2 Introduction

organizations are analyzing

Increasingly, current and historical data to identify useful patterns and support business strategies.
Emphasis is on complex, interactive, exploratory analysis of very large datasets created by integrating data from across all parts of an enterprise; data is fairly sta
On-Line Analytic Processing (OLAP) vs. On-line

Transaction Processing (OLTP)

OLTP

Most database operations involve

On- (OTLP).

Line Transaction Processing

– Short, simple, frequent queries and/or modifications, each involving a small number of tuples.
– Examples: Answering queries from a Web interface, sales at cash registers, selling airline tickets.

OLAP

Of increasing importance are

On-Line Application Processing (OLAP) queries.

– Few, but complex queries --- may run for hours.
– Queries do not depend on having an absolutely up-to-date database.

OLAP Examples

1. Amazon analyzes purchases by its customers to come up with an individual screen with products of likely interest to the customer.

Common Architecture

Databases at store branches handle OLTP queries.
Local store databases copied to a central warehouse overnight.
Analysts use the warehouse for OLAP and data mining.

Three Complementary Trends

Data Warehousing: Consolidate data from many sources in one large repository.

– Loading, periodic synchronization of replicas.
– Semantic integration.

OLAP: – Complex SQL queries and views.

– Queries based on spreadsheet-style operations and “multidimensional” view of data.
– Interactive and “online” queries.

Data Mining: Exploratory search for interesting trends and anomalies.

SOURCES Data Warehousing

EXTRACT

Integrated data spanning

TRANSFORM

long time periods, often

augmented with summary information.

Several gigabytes to

DATA Metadata WAREHOUSE terabytes common. Repository

Interactive response

SUPPORTS

times expected for complex queries; ad-hoc updates uncommon.

Warehousing Issues

When getting data from

Semantic Integration:

multiple sources, must eliminate mismatches, e.g., different currencies, schemas.

Heterogeneous Sources: Must access data

from a variety of source formats and repositories.

Must load data,

Load, Refresh, Purge: periodically refresh it, and purge too-old data.

Must keep track of

Metadata Management:

source, loading time, and other information for all data in the warehouse.

id Multidimensional s id me ti loc sale pid

Data Model

11 1 1 25

measures, which

Collection of numeric

11 2 1 8 depend on a set of dimensions. 11 3 1 15

Sales , dimensions

– E.g., measure

12 1 1 30

Product (key: pid), Location (locid), and Time (timeid).

12 2 1 20 12 3 1 50

8 10 10

Slice locid=1 13 1 1 8

pid 30 20 50

is shown: 13 2 1 10

25 8 15 11 12 13

13 3 1 10

locid 1 2 3

11 1 2 35

timeid MOLAP vs. ROLAP systems

(bar, beer, drinker, time, price) MOLAP and Data Cubes

systems: Multidimensional data are stored physically in a (disk-resident, persistent) multi dimensional array.

MOLAP
ROLAP systems: Multidimensional data are stored as relations.

– The main relation, which relates dimensions to a
measure, is called the fact table . Each dimension

can have additional attributes and an associated
dimension table .
– E.g., Sales(pid, locid, timeid, sales) or Sales

Keys of dimension tables are the dimensions of a hypercube.

– Example: for the Sales (bar,beer,drinker,time,price) data, the four dimensions are bar , beer , drinker , and time .

Dependent attributes (e.g., price ) appear at the points of the cube.

Visualization - Data Cubes price bar beer

Time?

4 th dimension Data Cube Marginals

The data cube also includes aggregation (typically SUM) along the margins of the cube.
The include aggregations

marginals over one dimension, two dimensions,… Visualization - Data Cube w/ Aggregation beer price bar

ROLAP and Star Schemas

A is a common organization

star schema for data at a warehouse. It consists of:

1. Fact table : a very large accumulation of facts such as sales.

Often “insert-only.”

: smaller, generally static

2. Dimension tables

information about the entities involved in the facts. Example: Star Schema

Suppose we want to record in a warehouse information about every beer sale: the bar, the brand of beer, the drinker who bought the beer, the day, the time, and the price charged.
The fact table is a relation:

Sales(bar, beer, drinker, day, time, price)

Example, Continued

The dimension tables include information about the bar, beer, and drinker “dimensions”:

Bars(bar, addr, license) Beers(beer, manf) Drinkers(drinker, addr, phone)

Visualization – Star Schema

Dimension Table (Bars) Dimension Table (Drinkers)

Dimension Attrs. Dependent Attrs.

Fact Table - Sales

Dimension Table (Beers) Dimension Table (Time, etc.) Dimensions and Dependent Attributes

Two classes of fact-table attributes:

: the key of a

1. Dimension attributes dimension table.

: a value

2. Dependent attributes

determined by the dimension attributes of the tuple. category week month state pname date city year quarter country

Dimension Hierarchies

For each dimension, the set of values can be organized in a hierarchy:

OLAP Queries • Influenced by SQL and by spreadsheets. aggregate a

A common operation is to measure over one or more dimensions.

– Find total sales.
– Find total sales for each city, or for each state.
– Find top five products ranked by total sales.

Aggregating at different levels of a

Roll-up: dimension hierarchy.

– E.g., Given total sales by city, we can roll-up to get
sales by state.

OLAP Queries The inverse of roll-up.

Drill-down:

– E.g., Can also drill-down on different dimension to get total sales by product for each year/quarter, etc.
– E.g., Given total sales by state, can drill-down to get
total sales by city.

Aggregation on selected dimensions.

Pivoting:

WI CA Total

– E.g., Pivoting on Location and Time yields this :

cross-tabulation 63 81 144 1995 1996 38 107 145 

Slicing and Dicing: Equality and range selections on one

1997 75 35 110 or more dimensions. 176 223 339 Total

63 81 144 1995

Comparison with 1996 38 107 145

SQL Queries 1997 75 35 110

The cross-tabulation

176 223 339 Total

obtained by pivoting

WI CA Total

SELECT SUM (S.sales)

FROM

Sales S, Times T, Locations L

WHERE S.timeid=T.timeid AND S.timeid=L.timeid

GROUP BY

T.year, L.state

SELECT SUM (S.sales) SELECT SUM (S.sales)

FROM

FROM Sales S, Times T Sales S, Location L

WHERE S.timeid=T.timeid WHERE S.timeid=L.timeid

GROUP BY

T.year L.state The CUBE Operator

Generalizing the previous example, if there are k dimensions, we have 2^k possible SQL

queries that can be generated through pivoting on a subset of dimensions.

CUBE pid, locid, timeid BY SUM Sales

– Equivalent to rolling up Sales on all eight subsets of the set {pid, locid, timeid};
– each roll-up corresponds to an SQL query of the form:

SELECT SUM (S.sales) FROM Sales S GROUP BY grouping-list Design Issues • Fact table in BCNF; dimension tables un-normalized.

– Dimension tables are small; updates/inserts/deletes are rare.

So, anomalies less important than query performance.

price category pname pid country state city locid sales locid timeid pid holiday_flag week date timeid month quarter year

(Fact table) SALES TIMES PRODUCTS

LOCATIONS

This kind of schema is very common in OLAP applications, and is called a star schema ; computing the join of all these relations is called a star join .

ROLAP Techniques

1. New indexing techniques: Bitmap indexes, Join indexes

2. Array representations, compression,

3. Pre-computation of aggregations (i.e., materialized views), etc.

4. We are going to cover in more details on:

Bitmap indexes
Materialized views

Bitmap Index sex rating _F _M custid name sex rating Bit-vector:

112 Joe M 3

10 00100

1 bit for each

115 Ram M 5

10 00001 possible value.

119 Sue F

01 00001

Many queries can be answered using

10 00010

112 Woo M 4

bit-vector ops! For each key value of a dimension table create a bit-vector telling which tuples of the fact table have that value.

Example OLAP Query

Often, OLAP queries begin with a “ star join ”: the natural join of the fact table with all or most of the dimension tables.
Example: SELECT bar, beer, SUM(price) FROM Sales NATURAL JOIN Bars NATURAL JOIN Beers WHERE addr = ’Palo Alto’ AND manf = ’Anheuser-Busch’

GROUP BY bar, beer;

Materialized Views

Store the answers to several useful queries (views) in the warehouse itself.
A direct execution of an OLAP query from the fact and the dimension tables could take too long (even with bitmap indexes)
If we create a materialized view that contains enough information, we may be able to answer our query much faster.

Materialized Views (cont.)

A view whose tuples are stored in the database is said to be materialized .

– Provides fast access, like a (very high-level) cache.
– Need to maintain the view as the underlying tables change.
– Ideally, we want incremental view maintenance algorithms.

Example OLAP Query

SELECT bar, beer, SUM(price) FROM Sales NATURAL JOIN Bars NATURAL JOIN Beers WHERE addr = ’Palo Alto’ AND manf = ’Anheuser-Busch’ GROUP BY bar, beer;

Example --- Continued

Here is a materialized view that could help:

CREATE VIEW BABMS(bar, addr, beer, manf, sales) AS

SELECT bar, addr, beer, manf,

SUM(price) sales FROM Sales NATURAL JOIN Bars NATURAL JOIN Beers GROUP BY bar, addr, beer, manf;

Since bar -> addr and beer -> manf, there is no real grouping. We need addr and manf in the SELECT. Example --- Concluded

Here’s our query using the materialized view BABMS: SELECT bar, beer, sales FROM BABMS WHERE addr = ’Palo Alto’ AND manf = ’Anheuser-Busch’;

Views in DW for OLAP queries • OLAP queries are typically aggregate queries.

– Precomputation is essential for interactive response
times.
– The CUBE is in fact a collection of aggregate queries, and precomputation is especially important
– lots of work on what is best to precompute given a
limited amount of space to store precomputed results.

Warehouses can be thought of as a collection of periodically updated tables and periodically maintained views.

View Modification (Evaluate On Demand)

CREATE VIEW RegionalSales (category,sales,state)

AS SELECT P.category, S.sales, L.state View

FROM Products P, Sales S, Locations L

WHERE AND P.pid=S.pid S.locid=L.locid

SELECT SUM R.category, R.state, (R.sales) Query

FROM AS GROUP BY RegionalSales R R.category, R.state

SELECT SUM

R.category, R.state, (R.sales)

FROM SELECT ( P.category, S.sales, L.state

Modified FROM

Products P, Sales S, Locations L Query

WHERE AND AS P.pid=S.pid S.locid=L.locid ) R

GROUP BY R.category, R.state View Materialization (Precomputation)

Suppose we precompute RegionalSales and store it with a clustered B+ tree index on [category,state,sales].

– Then, previous query can be answered by an index- only scan.

SELECT SUM R.category, R.state, (R.sales)

RegionalSales R R.category, R.state View Materialization (Precomputation)

Suppose we precompute RegionalSales and store it with a clustered B+ tree index on [category,state,sales].

– Then, previous query can be answered by an index- only scan.

SELECT R.state,

SUM (R.sales)

FROM RegionalSales R

WHERE R.category =“Laptop”

R.state SELECT

R.state, SUM

(R.sales) FROM

RegionalSales R WHERE

R. state=“Wisconsin”

R.category Index on precomputed view is great! Index is less useful (must scan entire leaf level). Data Mining Definition Data mining is the exploration and analysis of large quantities of data in order to discover , , valid novel potentially , and ultimately useful understandable patterns in data.

Example pattern (Census Bureau Data): If (relationship = husband), then (gender = male). 99.6%

Data Mining Definition (Cont.)

Data mining is the exploration and analysis of large quantities of

data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data.

Valid : The patterns hold in general. Novel : We did not know the pattern beforehand.

Useful : We can devise actions from the patterns.

Understandable : We can interpret and comprehend the patterns.

Why Use Data Mining Today?

Human analysis skills are inadequate:

– Volume and dimensionality of the data
– High data growth rate
– 3V in Big Data: Volume, Velocity, Variety

Availability of:

– Data – Storage – Computational power
– Off-the-shelf software
– Expertise

Preprocessing and Mining

Original Data Target Data Preprocessed

Data Patterns Knowledge

Data Integration and Selection Preprocessing

Model Construction Interpretation Example Application: Sports

IBM Advanced Scout analyzes NBA game statistics

– Shots blocked
– Assists – Fouls • Google: “IBM Advanced Scout”

Advanced Scout

Example pattern: An analysis of the data from a game played between the New York Knicks and the Charlotte Hornets revealed that “

When Glenn Rice played the shooting guard position, he shot 5/6 (83%) on jump shots."

Pattern is interesting:

The average shooting percentage for the Charlotte Hornets during that game was 54%.

Data Mining Techniques

Supervised learning

– Classification and regression

Unsupervised learning

– Clustering

Dependency modeling

– Associations, summarization, causality

• Trend analysis and change detection

Market Basket Analysis (Example Dependency Modeling)

Consider shopping cart filled with several items
Market basket analysis tries to answer the following questions:

– Who makes purchases?
– What do customers buy together?
– In what order do customers purchase items?

Market Basket Analysis

Given:

A database of customer transactions
Each transaction is a set of items
Example:

Transaction with TID 111 contains items {Pen, Ink,

TID CID Date Item Qty 111 201 5/1/99 Pen 2 111 201 5/1/99 Ink 1 111 201 5/1/99 Milk 3 111 201 5/1/99 Juice 6 112 105 6/3/99 Pen 1 112 105 6/3/99 Ink 1 112 105 6/3/99 Milk 1 113 106 6/5/99 Pen 1 113 106 6/5/99 Milk 1 114 201 7/1/99 Pen 2 114 201 7/1/99 Ink 2 114 201 7/1/99 Juice

Milk, Juice}

Market Basket Analysis (Contd.)

Coocurrences

– 80% of all customers purchase items X, Y and Z together.

Association rules

– 60% of all customers who purchase X and Y also buy Z.

Sequential patterns

– 60% of customers who first buy X also purchase Y within three weeks.

Confidence and Support We prune the set of all possible association rules using two interestingness measures: of a rule:

Confidence

– X  Y has confidence c if P(Y|X) = c

Support of a rule:

– X  Y has support s if P(XY) = s

We can also define of an itemset (a co-ocurrence)

Support

XY:

– XY has support s if P(XY) = s

Example

Examples:

{Pen} => {Milk}

Support: 75% Confidence: 75%

{Ink} => {Pen}

Support: 75%

Confidence: 100%

4 TID CID Date Item Qty 111 201 5/1/99 Pen 2 111 201 5/1/99 Ink 1 111 201 5/1/99 Milk 3 111 201 5/1/99 Juice 6 112 105 6/3/99 Pen 1 112 105 6/3/99 Ink 1 112 105 6/3/99 Milk 1 113 106 6/5/99 Pen 1 113 106 6/5/99 Milk 1 114 201 7/1/99 Pen 2 114 201 7/1/99 Ink 2 114 201 7/1/99 Juice

Example

Find all itemsets with support >= 75%?

4 Market Basket Analysis: Applications

Example

Can you find all association rules with support >= 50%?

Sample Applications

– Direct marketing
– Floor/shelf planning
– Web site layout
– Cross-selling

Frequent Itemset Algorithms

Applications • More abstract problem
Breadth-first search

Applications of Frequent Itemsets

Market Basket Analysis • Association Rules • Classification (especially: text, rare classes)
Seeds for construction of Bayesian Networks • Collaborative filtering

Problem Abstract:

Concrete:

A set of items {1,2,…,k}
I = {milk, bread, cheese, &hell
A dabase of transactions
D = { {milk,bread,cheese},

(itemsets ) D={T1, T2, …, Tn}, Tj subset {1,2,…,k}

{bread,cheese,juice }, …} GOAL:

GOAL: Find all itemsets that appear in at

Find all itemsets that appear in least x transactions at least 1000 transactions

(“appear in” == “are subsets of”) I subset T: T supports

I {milk,bread,cheese} supports {milk,bread}

For an itemset I, the number of transactions it appears in is called the support of I. x is called the minimum support .

Problem (Contd.) Definitions:

Example: frequent if it is D={ {1,2,3}, {1,2,3}, {1,2,3},

An itemset is a subset of at least x {1,2,4} } transactions. (FI.) Minimum support x = 3
An itemset is maximally

frequent if it is frequent and {1,2} is frequent it does not have a frequent

{1,2,3} is maximal frequent superset. (MFI.) Support( {1,2} ) = 4

GOAL: Given x, find all frequent All maximal frequent itemsets: (maximally frequent)

{1,2,3} itemsets (to be stored in the FI (MFI) ).

Obvious relationship: MFI subset FI

The Itemset Lattice

{} {2} {1} {4} {3} {1,2}

{2,3} {1,3} {1,4} {2,4} {1,2,3,4} {1,2,3}

{3,4} {1,2,4} {1,3,4} {2,3,4} Frequent Itemsets Frequent itemsets

Infrequent itemsets {} {2} {1} {4} {3} {1,2}

{2,3} {1,3} {1,4} {2,4} {1,2,3,4} {1,2,3} {3,4} {1,2,4} {1,3,4} {2,3,4} Breath First Search: 1-Itemsets {} {1} {2} {3} {4} {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Infrequent {1,2,3,4}

Frequent The Apriori Principle:

Currently examined I infrequent  ( I union {x} ) infrequent

Don’t know Breath First Search: 2-Itemsets {}

{1} {2} {3} {4}

{1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Infrequent {1,2,3,4}

Frequent Currently examined Don’t know Breath First Search: 3-Itemsets {} {1} {2} {3} {4} {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Infrequent {1,2,3,4}

Frequent The Apriori Principle:

Currently examined I infrequent  ( I union {x} ) infrequent

Don’t know Breadth First Search: Remarks

We prune infrequent itemsets and avoid to count them
To find an itemset with k items, we need to

count all 2 subsets

Breadth first search uses Apriori algorithm:

Next, we show how to implement A-Priori in SQL

Finding Frequent Pairs

• The simplest case is when we only want
to find “frequent pairs” of items.
Assume data is in a relation Baskets(basket, item) .
The

is the support threshold s minimum number of baskets in which a

pair appears before we are interested.

65 Frequent Pairs in SQL Look for two Basket tuples

SELECT b1.item, b2.item

with the same basket and

FROM Baskets b1, Baskets b2 different items. WHERE b1.basket = b2.basket

First item must precede second,

AND b1.item < b2.item

so we don’t count the same

GROUP BY b1.item, b2.item pair twice. HAVING COUNT(*) >= s;

Create a group for Throw away pairs of items each pair of items that do not appear at least that appears in at s times. least one basket. ₆₆ A-Priori Trick --- (1)

Straightforward implementation involves a join of a huge Baskets relation with itself.
The

speeds the a-priori algorithm query by recognizing that a pair of items { i , j } cannot have support s unless both { } and { } do. i j

67 A-Priori Trick --- (2)

to hold only

Use a materialized view information about frequent items.

INSERT INTO Baskets1(basket, item) SELECT * FROM Baskets

Items that

WHERE item IN (

appear in at least s baskets.

SELECT item FROM Baskets GROUP BY item HAVING COUNT(*) >= s ); ₆₈

Computing
Baskets1

– Running time shrinks with the square of the number of tuples involved in the join.

69 A-Priori Algorithm 1. Materialize the view Baskets1 .

Baskets1 is cheap, since it doesn’t involve a join.

probably has many fewer tuples than Baskets . Two Observations

(if x>=y, f(x)>=f(y))

Monotonic function

(if x>=y, f(x)<=f(y)) Antimonotonic fn.
can be applied to any constraint P

Apriori

that is antimonotone (e.g., support>const) – Start from the empty set.

supersets of sets that do not satisfy P.

– Prune

can also be applied to a monotone

Apriori

constraint Q (e.g., sum>const) – Start from set of all items instead of empty set.

subsets of sets that do not satisfy Q.

– Prune

Negative Pruning an Antimonotone P

{} {1} {2} {3} {4} {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4}

Frequent Infrequent {1,2,3,4}

Currently examined Don’t know Negative Pruning an Antimonotone P

{} {1} {2} {3} {4} {1,2} {1,3} {1,4} {2,3} {2,4} {3,4}

{1,2,3} {1,2,4} {1,3,4} {2,3,4} Frequent Infrequent

{1,2,3,4} Currently examined Don’t know Negative Pruning an Antimonotone P

{} {1} {2} {3} {4}

{1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Frequent Infrequent {1,2,3,4} Currently examined Don’t know

Negative Pruning a Monotone Q {}

{1} {2} {3} {4} {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4}

Satisfies Q Doesn’t satisfy Q {1,2,3,4} Currently examined Don’t know The New Problem

New Goal:

antimonotone

Given constraints P and Q, with P (support) and Q monotone (statistical constraint).
Find all itemsets that satisfy both P and Q.

Recent solutions:

Newer algorithms can handle both P and Q

Conceptual Illustration of Problem {} All supersets satisfy Q

Satisfies Q Satisfies P & Q Satisfies P

All subsets satisfy P

Summary

Decision support is an emerging, rapidly growing subarea of databases.
Involves the creation of large, consolidated data repositories called data warehouses.
Warehouses exploited using sophisticated analysis techniques: complex SQL queries and OLAP “multidimensional” queries (influenced by both SQL and spreadsheets).
New techniques for database design, indexing, view maintenance, and interactive querying need to be supported.

Summary (Cont.)

Data Mining

– Supervised – Unsupervised – Dependency modeling
– Outlier detection
– Trend analysis and prediction

COP 5725 Fall 2012 Database Management Systems

Line Transaction Processing

OLAP Examples

Common Architecture

DATA Metadata WAREHOUSE terabytes common. Repository

Visualization – Star Schema

Materialized Views

Availability of:

Preprocessing and Mining

IBM Advanced Scout analyzes NBA game statistics

Milk, Juice}

Confidence: 100%

Frequent Itemset Algorithms

Applications of Frequent Itemsets

The Itemset Lattice

Summary

Dokumen yang terkait

Relational Database Management System

Relational Database Management Systems for Epidemiologists: SQL Part I

Concurrency Control and Recovery in Database Systems pdf pdf

Database Reliability Engineering Designing and Operating Resilient Database Systems pdf pdf

Strategic Management of Information Systems in Healthcare

COP 5725 Fall 2012 Database Management Systems

COP 5725 Fall 2012 Database Management Systems

COP 5725 Fall 2012 Database Management Systems

COP 5725 Fall 2012 Database Management Systems

COP 5725 Fall 2012 Database Management Systems

Dukungan

Links

COP 5725 Fall 2012 Database Management Systems

Line Transaction Processing

OLAP Examples

Common Architecture

DATA Metadata WAREHOUSE terabytes common. Repository

Visualization – Star Schema

Materialized Views

Availability of:

Preprocessing and Mining

IBM Advanced Scout analyzes NBA game statistics

Milk, Juice}

Confidence: 100%

Frequent Itemset Algorithms

Applications of Frequent Itemsets

The Itemset Lattice

Summary

Dokumen yang terkait

Relational Database Management System

Relational Database Management Systems for Epidemiologists: SQL Part I

Concurrency Control and Recovery in Database Systems pdf pdf

Database Reliability Engineering Designing and Operating Resilient Database Systems pdf pdf

Strategic Management of Information Systems in Healthcare

COP 5725 Fall 2012 Database Management Systems

COP 5725 Fall 2012 Database Management Systems

COP 5725 Fall 2012 Database Management Systems

COP 5725 Fall 2012 Database Management Systems

COP 5725 Fall 2012 Database Management Systems

Dokumen yang Anda mencari sudah siap untuk unduhkan