Data Modeling for Data Warehouses
29.3 Data Modeling for Data Warehouses
Multidimensional models take advantage of inherent relationships in data to popu- late data in multidimensional matrices called data cubes. (These may be called hypercubes if they have more than three dimensions.) For data that lends itself to dimensional formatting, query performance in multidimensional matrices can be much better than in the relational data model. Three examples of dimensions in a corporate data warehouse are the corporation’s fiscal periods, products, and regions.
A standard spreadsheet is a two-dimensional matrix. One example would be a spreadsheet of regional sales by product for a particular time period. Products could
be shown as rows, with sales revenues for each region comprising the columns. (Figure 29.2 shows this two-dimensional organization.) Adding a time dimension,
29.3 Data Modeling for Data Warehouses 1071
P123 P124
oduct Pr
P125 P126
Figure 29.2
A two-dimensional matrix model.
such as an organization’s fiscal quarters, would produce a three-dimensional matrix, which could be represented using a data cube.
Figure 29.3 shows a three-dimensional data cube that organizes product sales data by fiscal quarters and sales regions. Each cell could contain data for a specific product,
Figure 29.3
Qtr 4
A three-dimensional
er
data cube model.
uart
Qtr 3
al_q Fisc
P125 P126 P127
Product
1072 Chapter 29 Overview of Data Warehousing and OLAP
specific fiscal quarter, and specific region. By including additional dimensions, a data hypercube could be produced, although more than three dimensions cannot be eas- ily visualized or graphically presented. The data can be queried directly in any com- bination of dimensions, bypassing complex database queries. Tools exist for viewing data according to the user’s choice of dimensions.
Changing from one-dimensional hierarchy (orientation) to another is easily accom- plished in a data cube with a technique called pivoting (also called rotation). In this technique the data cube can be thought of as rotating to show a different orienta- tion of the axes. For example, you might pivot the data cube to show regional sales revenues as rows, the fiscal quarter revenue totals as columns, and the company’s products in the third dimension (Figure 29.4). Hence, this technique is equivalent to having a regional sales table for each product separately, where each table shows quarterly sales for that product region by region.
Multidimensional models lend themselves readily to hierarchical views in what is known as roll-up display and drill-down display. A roll-up display moves up the hierarchy, grouping into larger units along a dimension (for example, summing weekly data by quarter or by year). Figure 29.5 shows a roll-up display that moves from individual products to a coarser-grain of product categories. Shown in Figure
29.6, a drill-down display provides the opposite capability, furnishing a finer- grained view, perhaps disaggregating country sales by region and then regional sales by subregion and also breaking up products by styles.
Figure 29.4
Pivoted version of the data
P 12
cube from Figure 29.3.
Reg 4 Region
29.3 Data Modeling for Data Warehouses 1073
Region Region 1
Region 2
Region 3
Products ies
1XX Products
2XX Products
Product categor 3XX Products
4XX
Figure 29.5
The roll-up operation.
Region 1
Region 2
Sub_reg 1
Sub_reg 2
Sub_reg 3
Sub_reg 4
Sub_reg 1
A P123
B Styles
CD
P124 A Styles
BC
A P125
B Styles Figure 29.6 C
D The drill-down operation.
The multidimensional storage model involves two types of tables: dimension tables and fact tables. A dimension table consists of tuples of attributes of the dimension.
A fact table can be thought of as having tuples, one per a recorded fact. This fact contains some measured or observed variable(s) and identifies it (them) with point- ers to dimension tables. The fact table contains the data, and the dimensions iden- tify each tuple in that data. Figure 29.7 contains an example of a fact table that can
be viewed from the perspective of multiple dimension tables. Two common multidimensional schemas are the star schema and the snowflake
schema. The star schema consists of a fact table with a single table for each dimen-
1074 Chapter 29 Overview of Data Warehousing and OLAP
Dimension table
Dimension table
Product
Fiscal quarter
Prod_no
Fact table
Qtr
Prod_name
Business results
Year
Prod_descr
Beg_date
Prod_style
Product
End_date
Prod_line
Quarter Region
Dimension table Figure 29.7
Sales_revenue
A star schema with fact and dimensional
Region tables.
Subregion
the dimensional tables from a star schema are organized into a hierarchy by normal- izing them (Figure 29.8). Some installations are normalizing data warehouses up to the third normal form so that they can access the data warehouse to the finest level of detail. A fact constellation is a set of fact tables that share some dimension tables. Figure 29.9 shows a fact constellation with two fact tables, business results and busi- ness forecast. These share the dimension table called product. Fact constellations limit the possible queries for the warehouse.
Data warehouse storage also utilizes indexing techniques to support high- performance access (see Chapter 18 for a discussion of indexing). A technique called bitmap indexing constructs a bit vector for each value in a domain (column)
Figure 29.8
A snowflake schema. Dimension tables
Dimension tables Pname
FQ dates Prod_name
Fiscal quarter
Beg_date Prod_descr
Fact table
End_date
Business results
Beg_date
Prod_no Prod_name
Prod_line_no
Region Revenue
Pline
Sales revenue
Prod_line_no
Region
29.4 Building a Data Warehouse 1075
Fact table I
Dimension table
Fact table II
Business results
Product
Business forecast
Product
Prod_no
Product
Quarter
Prod_name
Future_qtr
Region
Prod_descr
Region
Revenue
Prod_style
Projected_revenue
Prod_line
Figure 29.9
A fact constellation.
being indexed. It works very well for domains of low cardinality. There is a 1 bit placed in the jth position in the vector if the jth row contains the value being indexed. For example, imagine an inventory of 100,000 cars with a bitmap index on car size. If there are four car sizes—economy, compact, mid-size, and full-size— there will be four bit vectors, each containing 100,000 bits (12.5K) for a total index size of 50K. Bitmap indexing can provide considerable input/output and storage space advantages in low-cardinality domains. With bit vectors a bitmap index can provide dramatic improvements in comparison, aggregation, and join performance.
In a star schema, dimensional data can be indexed to tuples in the fact table by join indexing . Join indexes are traditional indexes to maintain relationships between primary key and foreign key values. They relate the values of a dimension of a star schema to rows in the fact table. For example, consider a sales fact table that has city and fiscal quarter as dimensions. If there is a join index on city, for each city the join index maintains the tuple IDs of tuples containing that city. Join indexes may involve multiple dimensions.
Data warehouse storage can facilitate access to summary data by taking further advantage of the nonvolatility of data warehouses and a degree of predictability of the analyses that will be performed using them. Two approaches have been used: (1) smaller tables including summary data such as quarterly sales or revenue by product line, and (2) encoding of level (for example, weekly, quarterly, annual) into existing tables. By comparison, the overhead of creating and maintaining such aggregations would likely be excessive in a volatile, transaction-oriented database.
Parts
» Fundamentals_of_Database_Systems,_6th_Edition
» Characteristics of the Database Approach
» Advantages of Using the DBMS Approach
» A Brief History of Database Applications
» Schemas, Instances, and Database State
» The Three-Schema Architecture
» The Database System Environment
» Centralized and Client/Server Architectures for DBMSs
» Classification of Database Management Systems
» Domains, Attributes, Tuples, and Relations
» Key Constraints and Constraints on NULL Values
» Relational Databases and Relational Database Schemas
» Integrity, Referential Integrity, and Foreign Keys
» Update Operations, Transactions, and Dealing with Constraint Violations
» SQL Data Definition and Data Types
» Specifying Constraints in SQL
» The SELECT-FROM-WHERE Structure of Basic SQL Queries
» Ambiguous Attribute Names, Aliasing, Renaming, and Tuple Variables
» Substring Pattern Matching and Arithmetic Operators
» INSERT, DELETE, and UPDATE Statements in SQL
» Comparisons Involving NULL and Three-Valued Logic
» Nested Queries, Tuples, and Set/Multiset Comparisons
» The EXISTS and UNIQUE Functions in SQL
» Joined Tables in SQL and Outer Joins
» Grouping: The GROUP BY and HAVING Clauses
» Discussion and Summary of SQL Queries
» Specifying General Constraints as Assertions in SQL
» Introduction to Triggers in SQL
» Specification of Views in SQL
» View Implementation, View Update, and Inline Views
» Schema Change Statements in SQL
» Sequences of Operations and the RENAME Operation
» The UNION, INTERSECTION, and MINUS Operations
» The CARTESIAN PRODUCT (CROSS PRODUCT) Operation
» Variations of JOIN: The EQUIJOIN and NATURAL JOIN
» Additional Relational Operations
» Examples of Queries in Relational Algebra
» The Tuple Relational Calculus
» The Domain Relational Calculus
» Using High-Level Conceptual Data Models
» Entity Types, Entity Sets, Keys, and Value Sets
» Relationship Types, Relationship Sets, Roles, and Structural Constraints
» ER Diagrams, Naming Conventions, and Design Issues
» Example of Other Notation: UML Class Diagrams
» Relationship Types of Degree Higher than Two
» Subclasses, Superclasses, and Inheritance
» Constraints on Specialization and Generalization
» Specialization and Generalization Hierarchies
» Modeling of UNION Types Using Categories
» A Sample UNIVERSITY EER Schema, Design Choices, and Formal Definitions
» Data Abstraction, Knowledge Representation, and Ontology Concepts
» ER-to-Relational Mapping Algorithm
» Discussion and Summary of Mapping for ER Model Constructs
» Mapping EER Model Constructs
» The Role of Information Systems
» The Database Design and Implementation Process
» Use of UML Diagrams as an Aid to Database Design Specification 6
» Rational Rose: A UML-Based Design Tool
» Automated Database Design Tools
» Introduction to Object-Oriented Concepts and Features
» Object Identity, and Objects versus Literals
» Complex Type Structures for Objects and Literals
» Encapsulation of Operations and Persistence of Objects
» Type Hierarchies and Inheritance
» Other Object-Oriented Concepts
» Object-Relational Features: Object Database Extensions to SQL
» Overview of the Object Model of ODMG
» Built-in Interfaces and Classes in the Object Model
» Atomic (User-Defined) Objects
» Extents, Keys, and Factory Objects
» The Object Definition Language ODL
» Differences between Conceptual Design of ODB and RDB
» Mapping an EER Schema to an ODB Schema
» Query Results and Path Expressions
» Overview of the C++ Language Binding in the ODMG Standard
» Structured, Semistructured, and Unstructured Data
» XML Hierarchical (Tree) Data Model
» Well-Formed and Valid XML Documents and XML DTD
» XPath: Specifying Path Expressions in XML
» XQuery: Specifying Queries in XML
» Extracting XML Documents from
» Database Programming: Techniques
» Retrieving Single Tuples with Embedded SQL
» Retrieving Multiple Tuples with Embedded SQL Using Cursors
» Specifying Queries at Runtime Using Dynamic SQL
» SQLJ: Embedding SQL Commands in Java
» Retrieving Multiple Tuples in SQLJ Using Iterators
» Database Programming with SQL/CLI Using C
» JDBC: SQL Function Calls for Java Programming
» Database Stored Procedures and SQL/PSM
» PHP Variables, Data Types, and Programming Constructs
» Overview of PHP Database Programming
» Imparting Clear Semantics to Attributes in Relations
» Redundant Information in Tuples and Update Anomalies
» Normal Forms Based on Primary Keys
» General Definitions of Second and Third Normal Forms
» Multivalued Dependency and Fourth Normal Form
» Join Dependencies and Fifth Normal Form
» Inference Rules for Functional Dependencies
» Minimal Sets of Functional Dependencies
» Properties of Relational Decompositions
» Dependency-Preserving Decomposition
» Dependency-Preserving and Nonadditive (Lossless) Join Decomposition into 3NF Schemas
» Problems with NULL Values and Dangling Tuples
» Discussion of Normalization Algorithms and Alternative Relational Designs
» Further Discussion of Multivalued Dependencies and 4NF
» Other Dependencies and Normal Forms
» Memory Hierarchies and Storage Devices
» Hardware Description of Disk Devices
» Magnetic Tape Storage Devices
» Placing File Records on Disk
» Files of Unordered Records (Heap Files)
» Files of Ordered Records (Sorted Files)
» External Hashing for Disk Files
» Hashing Techniques That Allow Dynamic File Expansion
» Other Primary File Organizations
» Parallelizing Disk Access Using RAID Technology
» Types of Single-Level Ordered Indexes
» Some General Issues Concerning Indexing
» Algorithms for External Sorting
» Implementing the SELECT Operation
» Implementing the JOIN Operation
» Algorithms for PROJECT and Set
» Notation for Query Trees and Query Graphs
» Heuristic Optimization of Query Trees
» Catalog Information Used in Cost Functions
» Examples of Cost Functions for SELECT
» Examples of Cost Functions for JOIN
» Example to Illustrate Cost-Based Query Optimization
» Factors That Influence Physical Database Design
» Physical Database Design Decisions
» An Overview of Database Tuning in Relational Systems
» Transactions, Database Items, Read and Write Operations, and DBMS Buffers
» Why Concurrency Control Is Needed
» Transaction and System Concepts
» Desirable Properties of Transactions
» Serial, Nonserial, and Conflict-Serializable Schedules
» Testing for Conflict Serializability of a Schedule
» How Serializability Is Used for Concurrency Control
» View Equivalence and View Serializability
» Types of Locks and System Lock Tables
» Guaranteeing Serializability by Two-Phase Locking
» Dealing with Deadlock and Starvation
» Concurrency Control Based on Timestamp Ordering
» Multiversion Concurrency Control Techniques
» Validation (Optimistic) Concurrency
» Granularity of Data Items and Multiple Granularity Locking
» Using Locks for Concurrency Control in Indexes
» Other Concurrency Control Issues
» Recovery Outline and Categorization of Recovery Algorithms
» Caching (Buffering) of Disk Blocks
» Write-Ahead Logging, Steal/No-Steal, and Force/No-Force
» Transaction Rollback and Cascading Rollback
» NO-UNDO/REDO Recovery Based on Deferred Update
» Recovery Techniques Based on Immediate Update
» The ARIES Recovery Algorithm
» Recovery in Multidatabase Systems
» Introduction to Database Security Issues 1
» Discretionary Access Control Based on Granting and Revoking Privileges
» Mandatory Access Control and Role-Based Access Control for Multilevel Security
» Introduction to Statistical Database Security
» Introduction to Flow Control
» Encryption and Public Key Infrastructures
» Challenges of Database Security
» Distributed Database Concepts 1
» Types of Distributed Database Systems
» Distributed Database Architectures
» Data Replication and Allocation
» Example of Fragmentation, Allocation, and Replication
» Query Processing and Optimization in Distributed Databases
» Overview of Transaction Management in Distributed Databases
» Overview of Concurrency Control and Recovery in Distributed Databases
» Current Trends in Distributed Databases
» Distributed Databases in Oracle 13
» Generalized Model for Active Databases and Oracle Triggers
» Design and Implementation Issues for Active Databases
» Examples of Statement-Level Active Rules
» Time Representation, Calendars, and Time Dimensions
» Incorporating Time in Relational Databases Using Tuple Versioning
» Incorporating Time in Object-Oriented Databases Using Attribute Versioning
» Temporal Querying Constructs and the TSQL2 Language
» Spatial Database Concepts 24
» Multimedia Database Concepts
» Clausal Form and Horn Clauses
» Datalog Programs and Their Safety
» Evaluation of Nonrecursive Datalog Queries
» Introduction to Information Retrieval
» Types of Queries in IR Systems
» Evaluation Measures of Search Relevance
» Web Analysis and Its Relationship to Information Retrieval
» Analyzing the Link Structure of Web Pages
» Approaches to Web Content Analysis
» Trends in Information Retrieval
» Data Mining as a Part of the Knowledge
» Goals of Data Mining and Knowledge Discovery
» Types of Knowledge Discovered during Data Mining
» Market-Basket Model, Support, and Confidence
» Frequent-Pattern (FP) Tree and FP-Growth Algorithm
» Other Types of Association Rules
» Approaches to Other Data Mining Problems
» Commercial Data Mining Tools
» Data Modeling for Data Warehouses
» Difficulties of Implementing Data Warehouses
» Grouping, Aggregation, and Database Modification in QBE
Show more