Course: IKI83403T : Data Mining and Business Intelligence Lect2 DW and OLAP

Objectives
Motivation: Why data warehouse?
What is a data warehouse?
Whyy separate
p
DW?
Conceptual modeling of DW
Data Mart
Data Warehousing Architectures
Data Warehouse Development
Data Warehouse Vendors
R l
Real-time
DW

`
`
`
`
`
`


Data Warehousing and OLAP
Lecture 2/DMBI/IKI83403T/MTI/UI

`
`

Yudho Giri Sucahyo, Ph.D, CISA (yudho@cs.ui.ac.id)
Faculty of Computer Science, University of Indonesia

`

2

Motivation: Why data warehouse?
`

`
`
`

`

Construction of data warehouses (DW) involves data
cleaning and data integration Æ important
preprocessing step for data mining (DM).
DW provide OLAP for the interactive analysis of
multidimensional data,, which facilitates effective DM.
Data mining functions can be integrated with OLAP
operations to enhance interactive mining of knowledge.
DW will provide an effective platform for DM.
Whil DWs
While
DW are nott requirements
i
t to
t do
d DM,
DM DW store
t
massive amounts of data that can be uses for DM. [DO]

3

What is a data warehouse? [JH]
Defined in many different ways, but not rigorously.

`

`
`

A decision support database that is maintained separately
from the organization’s ODB.
Support information processing by providing a solid platform
of consolidated, historical data for analysis.

“A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile collection of data in
support of management’s decision-making process.” —
W. H. Inmon
Case Study 2: Continental Airlines flies high with its

real-time data warehouse

`

`

4

What is a data warehouse? [ET]
Data warehouse
A physical repository where relational data are specially
organized to provide enterprise-wide, cleansed data in a
standardized format.
Characteristics

`

`

`

`
`

Subject oriented, Integrated, Time Variant, Non-volatile
Web-based, Relational/multidimensional, Client/server, Real-time
Include metadata

Data warehousing

`

`
`

Process of constructing and using data warehouses.
Requires data integration, data cleaning, and data consolidation.

Subject Oriented
Organized around major subjects, such as
customer, product, sales.

` Provide
P id a simple
i l and
d concise
i view
i around
d
particular subject issues by excluding data that
are not useful in the decision support process.
` Focusing on the modeling and analysis of data
for decision makers, not on daily operations or
transaction processing.
processing
`

5

6

Integrated

Integrate multiple, heterogeneous data sources

`

`

Time Variant
The time horizon for the data warehouse is significantly
longer than that of operational systems.

`

Relational databases, flat-files, on-line transaction records

Data cleaningg and data integration
g
techniques
q
are
applied


`

`

Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
`

`

7

`

Operational database: current value data.

`


Data warehouse data: provide information from a historical
perspective (e
(e.g.,
g past 5-10 years)

Every key structure in the data warehouse

`

E.g., Hotel price: currency, tax, breakfast covered, etc.

Wh data
When
d is
i moved
d to the
h warehouse,
h
it
i is

i converted.
d

8

`

Contains an element of time, explicitly or implicitly

`

But the key of operational data may or may not contain “time
element”.

Non volatile
Non-volatile
Ap
physically
y
y separate

p
store of data transformed from the

`

Data Warehouse vs
vs. Heterogeneous DBMS
`

operational environment.
O
Operational
i l update
d
off data
d does
d
not occur in
i the
h d
data

`

Traditional heterogeneous DB integration:
`

Build wrappers/mediators on top of multiple, heterogeneous databases.
Ex: IBM Data Joiner, Informix DataBlade

`

Q
Query
d
driven
i
approach:
h

warehouse environment.
`

`

When a query is posed to a client site, a metadata-dictionary is used
to translate the query into queries appropriate for the individual
heterogeneous sites involved. There queries are then mapped and sent
to local query processors. The results returned from the different
sites are integrated
d into a global
l b l answer set.

`

Complex information filtering and integration processes, compete for
resources.
resources

`

Inefficient and potentially expensive for frequent queries, especially for

Does not require transaction processing, recovery, and
concurrencyy control mechanisms

`

Requires only two operations in data accessing:
`

i i i l loading
initial
l di off data
d and
d access off data.
d

queries requireing
q
q
g aggregations.
gg g
9

10

Data Warehouse vs
vs. Heterogeneous DBMS (2)
`

Using DW Æ update-driven approach
`

`
`

`
`

`

Major task of ODB Æ OLTP:

`

Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
DW serve ffor data
d analysis
l i and
d decision
d i i making
ki Æ OLAP

`

Distinct Features (OLTP vs. OLAP)

Information from multiple, heterogeneous sources is integrated in advance
and stored in a warehouse for direct querying and analysis.

Unlike OLTP,
OLTP DW do not contain the most current information
information.
DW brings high performance to the integrated heterogeneous
DB system since data are copied,
copied preprocessed,
preprocessed integrated,
integrated
annotated, summarized, and restructured into one data store.
Query processing in DW does not interfere with the processing
at local sources
DW can store and integrate
g
historical information and support
pp
complex multidimensional queries.

11

DW vs.
vs ODB
`

`
`
`
`
`

12

User and
U
d system orientation:
i
i customer vs. market
k
Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs.
vs star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read
read-only
only but complex queries

OLTP vs OLAP

Why Separate DW?

OLTP

OLAP

users

Clerk IT professional
Clerk,

Knowledge worker

function

day to day operations

decision support

DB design

application-oriented

subject-oriented

data

current, up-to-date
detailed, flat relational
isolated
repetitive

historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans

unit of work

read/write
index/hash on prim. key
short, simple transaction

complex query

# records accessed

tens

millions

#users

thousands

hundreds

DB size

100MB-GB

100GB-TB

metric

transaction throughput

query throughput, response

usage
access

`
`

`

`

14

Why Separate DW? (2)
Different functions and different data:
`

`

`

missing data: Decision support requires historical data which
operational DBs do not typically maintain. So, data in ODB is
p
for decision making.
g
usuallyy far from complete
data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources. ODB
contain
t i detailed
d t il d raw data
d t (transactions)
(t
ti ) which
hi h need
d to
t be
b
consolidated before analysis.
data q
quality:
y different sources typically
yp
y use inconsistent data
representations, codes and formats which have to be
reconciled.

Conceptual Modeling of DW
`

Data Cube:
`

`

see TSBD Lecture Notes on Visualization of Data Cubes

M d li data
Modeling
d t warehouses:
h
dimensions
di
i
& measurements
t
`

`

`

15

DBMS — tuned for OLTP: access methods, indexing,
concurrency control, recovery
Warehouse — tuned for OLAP: complex OLAP queries,
computation of large groups of data at summarized levels,
multidimensional view,, consolidation.

Processing OLAP queries in operational databases would
degrade the performance of operational tasks.
In ODB, concurrency control and recovery mechanisms
(locking, logging) are required to ensure the consistency
and
d robustness
b
off transactions.
i
OLAP Æ read only access. No need for concurrency
control and recovery.
recovery

`

13

`

High
g performance
p
for both systems:
y

`

Star schema: A single object (fact table) in the middle connected
to a number of objects (dimension tables,
tables one for each
dimension).
Snowflake schema: A refinement of star schema where the
dimensional hierarchy is represented explicitly by normalizing
the dimension tables.
Fact constellations: Multiple fact tables share dimension tables.
` Also known as galaxy schema

16

Example of Star Schema

Snowflake Schema
Product

Date
Sales Fact Table
Date
Product

Year

Product

Month
Date

Month
Year

Day
Month

Date
Product

Store

Store
StoreID
City
State
Country
Region

Cust
Customer

dollar_sales

Measurements

`

City

Country

Cust

StoreID
Cit
City

Customer

dollar_sales
Yen_sales

Measurements

18

View of Warehouses and Hierarchies

Data Cube
D t
Date

19

CustId
CustName
CustCity
CustCountry

unit_sales

State
Country

Country
Region

Bandung, Bogor keduanya
ada di Jawa Barat

17

City
State

State

Potensi Redundansi
`

Store

Store

CustId
C tN
CustName
CustCity
CustCountry

unit_sales

Yen_sales

ProductNo
ProdName
ProdDesc
Category
QOH

Sales Fact Table

`

Importing data

`

Table Browsing

`

Dimension creation

`

Dimension browsing

`

Cube buildingg

`

Cube browsing

TV
PC
VCR
sum

1Qtr

2Qtr

3Qtr

4Qtr

sum

Total annual sales
of TV in U.S.A.
USA
U.S.A
Ca ada
Canada
Mexico
sum

20

Country

Day
Month
Year

Year

ProductNo
ProdName
ProdDesc
C
Category
QOH

Data Cube

Typical OLAP Operations
`
`

`
`
`

`
`
21

`

Visualization
OLAP capabilities
p
Interactive manipulation

Data Mart
`

`
`

`

DW collects information about subjects that span the
entire organization, such as customers, products, sales, assets,
and personnel. Its scope is enterprise-wide.
For DW, fact constellation schema is commonly used
since it can model multiple, interrelated subjects.
Data Mart is a subset of a DW, focuses on a particular
subject. Its scope is department-wide. Typically, a data mart
consisting off a single
l subject
b
area (e.g.
(
marketing,
k
operations).
For Data Mart, star or snowflake schema are commonly
used since both are geared towards modeling single
subjects,
bj t although
lth h th
the star
t schema
h
iis more popular.
l
23

`

Roll up (drill-up): summarize data
` by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
` from higher level summary to lower level summary or detailed data,
data or
introducing new dimensions
Slice and dice:
` project and select
Pivot (rotate):
` reorient the cube, visualization, 3D to series of 2D planes.
Other operations
` drill
d ill across: iinvolving
l i ((across)) more than
th one fact
f t table.
t bl
` drill through: through the bottom level to its back-end relational tables.
More info:
`

www.knowledgecenters.org, www.olapreport.com, www.olapcouncil.org

22

Data Mart
`
`

A data mart can be either dependent or independent.
A dependent data mart is a subset that is created directly
from the DW.
`
`
`
`

`

Consistent data model
Providing quality data
DW must be constructed first
Ensures that the user viewing the same version of the data that
are accessed
d by
b allll other
h d
data warehouse
h
users

An independent data mart is a small warehouse designed
f ad
for
department, and
d iits source is
i not an EDW.
EDW

24

Data Warehousing Process Overview

Data Warehousing Process Overview
`

The major components of a data warehousing process
`

`
`
`
`
`

25

Data Warehousing Architectures

27

Data sources
` Legacy systems, external data providers (e.g. BPS), OLTP,
ERP Systems
Data extraction
Data loading
Comprehensive database
Metadata
Middleware tools

26

Data Warehousing Architectures

28

Data Warehousing Architectures

Data Warehousing Architectures

29

30

Data Integration and the ETL Process
`

Various integration technologies:
`

Enterprise Application Integration (EAI)
`
`
`

`

`

ETL
`
`
`

`

Enterprise Information Integration (EII)
`

`

31

A technology that provides a vehicle for pushing data from source
systems
t
iinto
t a data
d t warehouse
h
Integrating application functionality and is focused on sharing
functionality across systems
Traditionally, API. Nowadays, SOA (web services).

Data Integration and the ETL Process

An evolving tool space that promises real-time data integration from
a variety of sources, such as relational databases, Web services, and
multidimensional databases
A mechanism for pulling data from source systems to satisfy a request
for information.
32

60-70% of the time in a data-centric project.
Extraction: Reading data from one or more databases
Transformation
` Converting the extracted data from its previous form into the form in
which it needs to be so that it can be placed into a DW
Load
` Putting the
data into
d
the DW

Data Warehouse Development
`

`

Direct benefits
`
`
`
`

`

Data Warehouse Development
Some best practices for implementing a DW (Weir, 2002):
`

Allowing end users to perform extensive analysis in numerous
ways
A consolidated view off corporate data (i.e
( a single version off
the truth)
Better and more timely information
Enhanced system performance. DW frees production
processing because some operational system reporting
requirements are moved to DSS
Simplification of data access

`
`
`
`
`
`
`
`
` 34

33

Data Warehouse Vendors
`
`
`
`
`
`
`
`
`
`
`

Computer Associates
DataMirror
Data Advantage
g Groupp
Dell Computer
Embarcadero Technologies
Business Objects
HP
Hummingbird
H
Hyperion
IBM
Informatica
35

`
`
`
`
`
`
`

Microsoft
Oracle
SAS
Siemens
Sybase
Teradata
Please visit:
`
`

Data Warehousing Institute
(tdwi com)
(tdwi.com)
DM Review (dmreview.com)

Project must fit with corporate strategy and business objectives
There must be complete buy-in to the project by executives,
managers and users
managers,
It is important to manage user expectations about the completed
project
The data warehouse must be built incrementally
Build in adaptability
M
Managed
d bby bboth
h IT and
d bbusiness
i
professionals
f i l
Develop a business/supplier relationship
O l lload
Only
dd
data
t that
th t have
h
been
b
cleansed
l
d and
d are off a quality
lit
understood by the organization
Do not overlook training requirements
Be politically aware

Data Warehouse Vendors
`

Six gguidelines to considered when developing
p ga
vendor list:
1
1.
2.
3.
4.
5.
6.

36

Financial strength
ERP linkages
Q lifi d consultants
Qualified
l
Market share
Industry experience
Established partnerships
p
p

Real time DW
Real-time

Real time DW
Real-time

Traditionally, updated on a weekly basis.
Unsuitable for some businesses.
Real-time ((active)) data warehousing
g
The process of loading and providing data via a data
warehouse as theyy become available
Levels of data warehouses:

`
`
`

`

1.
2.
3.
4.
5.

Reports what happened
Some analysis occurs
Provides p
prediction capabilities,
p
,
Operationalization
Becomes capable
p
of makingg events happen
pp

37

38

Real time DW
Real-time

From DW to DM [JH]
`

Three kinds of data warehouse applications
`

`

`

39

Information processing
` supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
Analytical processing
` multidimensional analysis of data warehouse data
` supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining
` knowledge discovery from hidden patterns
` supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining
i i results
lt using
i visualization
i li ti ttools.
l

40

References
`
`
`

[JH] Jiawei Han and Micheline Kamber, Data Mining:
Concepts and Techniques, Morgan Kaufmann, 2001.
[ET] Efraim Turban et al., Decision Support and Business
Intelligence Systems, Pearson, 2007.
[DO] David Olson and Yong Shi, Introduction to Business
Data Mining, McGraw-Hill, 2007.

41