(A, r ), the number of distinct values that appear in the relation r for attribute
• V (A, r ), the number of distinct values that appear in the relation r for attribute
A A (r ). If Ais a key for relation r , V(A, r ) is n r .
The last statistic, V(A, r ), can also be maintained for sets of attributes, if desired, instead of just for individual attributes. Thus, given a set of attributes, A, V(A, r )
A (r ). If we assume that the tuples of relation r are stored together physically in a file, the following equation holds:
Statistics about indices, such as the heights of B + -tree indices and number of leaf pages in the indices, are also maintained in the catalog.
If we wish to maintain accurate statistics, then, every time a relation is modi- fied, we must also update the statistics. This update incurs a substantial amount of overhead. Therefore, most systems do not update the statistics on every mod- ification. Instead, they update the statistics during periods of light system load. As a result, the statistics used for choosing a query-processing strategy may not
be completely accurate. However, if not too many updates occur in the intervals between the updates of the statistics, the statistics will be sufficiently accurate to provide a good estimation of the relative costs of the different plans.
The statistical information noted here is simplified. Real-world optimizers often maintain further statistical information to improve the accuracy of their cost estimates of evaluation plans. For instance, most databases store the distribution of values for each attribute as a histogram : in a histogram the values for the attribute are divided into a number of ranges, and with each range the histogram associates the number of tuples whose attribute value lies in that range. Figure 13.6 shows an example of a histogram for an integer-valued attribute that takes values in the range 1 to 25.
Histograms used in database systems usually record the number of distinct values in each range, in addition to the number of tuples with attribute values in that range.
As an example of a histogram, the range of values for an attribute age of a re- lation person could be divided into 0–9, 10–19, . . . , 90–99 (assuming a maximum age of 99). With each range we store a count of the number of person tuples whose age values lie in that range, and the number of distinct age values that lie in that
592 Chapter 13 Query Optimization
Figure 13.6 Example of histogram.
range. Without such histogram information, an optimizer would have to assume that the distribution of values is uniform; that is, each range has the same count.
A histogram takes up only a little space, so histograms on several different at- tributes can be stored in the system catalog. There are several types of histograms used in database systems. For example, an equi-width histogram divides the range of values into equal-sized ranges, whereas an equi-depth histogram ad- justs the boundaries of the ranges such that each range has the same number of values.
13.3.2 Selection Size Estimation
The size estimate of the result of a selection operation depends on the selection predicate. We first consider a single equality predicate, then a single comparison predicate, and finally combinations of predicates.
Parts
» Indian Institute of Technology, Bombay
» Data Mining and Information Retrieval
» Structure of Relational Databases
» Database Schema When we talk about a database, we must differentiate between the database
» Basic Structure of SQL Queries
» Modification of the Database
» • Embedded SQL : Like dynamic SQL , embedded SQL provides a means by
» Advanced Aggregation Features**
» The Cartesian-Product Operation
» The Tuple Relational Calculus
» The Entity-Relationship Model
» • For an n-ary relationship set with an arrow on one of its edges, the primary
» Entity-Relationship Design Issues
» Representation of Generalization
» Alternative Notations for Modeling Data
» Other Aspects of Database Design
» Features of Good Relational Designs
» Atomic Domains and First Normal Form
» Decomposition Using Functional Dependencies
» BCNF Decomposition Algorithm
» Decomposition Using Multivalued Dependencies
» Application Programs and User Interfaces
» Overview of Physical Storage Media
» Magnetic Disk and Flash Storage
» Organization of Records in Files
» Comparison of Ordered Indexing and Hashing
» Implementation of Pipelining
» Evaluation Algorithms for Pipelining
» Transformation of Relational Expressions
» (A, r ), the number of distinct values that appear in the relation r for attribute
» Advanced Topics in Query Optimization**
» Transaction Atomicity and Durability
» Transaction Isolation and Atomicity
» Implementation of Isolation Levels
» Transactions as SQL Statements
» Weak Levels of Consistency in Practice
» Concurrency in Index Structures**
» Failure with Loss of Nonvolatile Storage
» Early Lock Release and Logical Undo Operations
» Centralized and Client – Server Architectures
» Parallelism on Multicore Processors
» Recovery and Concurrency Control
» Distributed Query Processing
» Heterogeneous Distributed Databases
» Partitioning and Retrieving Data
» Transactions and Replication
» Decision-Tree Construction Algorithm
» Relevance Ranking Using Terms
» Synonyms, Homonyms, and Ontologies
» Crawling and Indexing the Web
» Information Retrieval: Beyond Ranking of Pages
» Structured Types and Inheritance in SQL
» Array and Multiset Types in SQL
» Application Program Interfaces to XML
» Native Storage within a Relational Database
» Other Issues in Application Development
» Representation of Geographic Data
» Transaction-Processing Monitors
» Real-Time Transaction Systems
» PostgreSQL Implementation of MVCC
» Database Design and Querying Tools
» Database Administration Tools
» Business Intelligence Features
Show more