COP 5725 Fall 2012 Database Management Systems

University of Florida, CISE Department _{Prof. Daisy Zhe Wang}

Relational Query Optimization Plan Equivalence Plan Enumeration

Cost Estimation

Overview of Query Optimization

(Review)

Query Plan Tree of R.A. ops, with choice of alg for each op.

Each operator typically implemented using a `pull’

– interface: when an operator is `pulled’ for the next output tuples, it `pulls’ on its inputs and computes them.

Two main issues:

– For a given query, what plans are considered ? • Algorithm to search plan space for cheapest (estimated) plan.

How is the cost of a plan estimated ?

– Want to find best plan. Practically: Avoid

Ideally:

worst plans!

Schema for Examples

Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: dates, rname: string)

Similar to old schema; added for

rname variations.

Reserves:

– Each tuple is 40 bytes long, 100 tuples per page, 1000 pages.

Sailors:

– Each tuple is 50 bytes long, 80 tuples per page,

Translating SQL Queries into RA

Decomposing a query into blocks

– How a query is decomposed into blocks

Translate a query block as a RA Expression

– how the optimization of a single block can
be understood in terms of plans composed
of RA operators.

Query Blocks: Units of Optimization

In a typical query optimizer, an SQL _SELECT

S.sname query is parsed into a collection of _FROM Sailors S

, and these are

query blocks _WHERE _IN

S.age optimized one block at a time. _{SELECT MAX} ( (S2.age)

We focus mostly on optimizing a _FROM

Sailors S2

single query block and defer _{GROUP BY}

S2.rating )

discussion of nested query

Outer block Nested block optimization. Another Example SELECT _FROM S.sid, MIN (R.day) _WHERE Sailors S, Reserves R, Boats B

S.sid = R.sid AND R.bid = B.bid AND B.color

= ‘red’ _{SELECT MAX} AND S.rating = ( (S2.age) _FROM _{GROUP BY} Sailors S2) _HAVING S.sid

COUNT(*) >1

Outer block Nested block

Translate a query block to an RA Expression _SELECT _{S.sid, MIN (R.day)}

A single _FROM _{Sailors S, Reserves R, Boats B} query _WHERE _{S.sid = R.sid AND R.bid = B.bid} block: _{AND B.color}

_{GROUP BY}

_{AND S.rating = Reference to nested block}

_S.sid

HAVING _{COUNT(*) >1}

First step in optimizing a query block is to express it represent the semantics of an SQL query.

 _{S.sid, MIN (R.day)} _{GROUP BY S.sid (} _{HAVING COUNT(*) >1 (} ( S.sid = R.sid AND R.bid = B.bid AND B.color = ‘red’

 _{AND S.rating = Reference to nested block} ₍ Translate a query block to an RA Expression (Cont.)

Typical relational query optimizer recognize a query is essentially a SPJ RA expr., with the remaining opr. carried out on the result of the SPJ expr.
Move group by, having, aggregation as the last steps after the regular (SPJ) RA expression _×

_{S.sid, R.day}

₍

_{S.sid = R.sid AND R.bid = B.bid AND B.color = ‘red’}

_{Sailors × Reserves × Boats ))}

_{AND S.rating = Reference to nested block}

_{(<To make sure that group by and having can be carried out, the attributes mentioned are added to the projection of the SPJ expression}

Translate a query block to an RA Expression (Cont.)  _{S.sid, MIN (R.day)} _{GROUP BY S.sid (} _{HAVING COUNT(*) >1 (} (

 _{S.sid, R.day} ₍ 

_{S.sid = R.sid AND R.bid = B.bid AND B.color = ‘red’}

_{AND S.rating = Reference to nested block}

₍

The optimizer can find the best plan for the SPJ expression: plan enumeration + cost estimation
The plan is evaluated and resulting tuples are sorted/hashed for followed by the

group by application of the clause and the . having aggregation S.sid, R.day ₍ _{S.sid = R.sid AND R.bid = B.bid AND B.color = ‘red’} _{AND S.rating = Reference to nested block} ₍ Sailors × Reserves × Boats ))

Plan Enumeration using RA Equivalences

An optimizer enumerates plans by applying several equivalences between RA expressions
Allow us to choose different join orders and to `push’ selections and projections ahead of joins.

Selections

(Cascade) 



   

 (Commute)

R (S T) (T R) S

⁺ Show that:

(R S) (S R) 

(Associative) 

 

R (S T) (R S) T  

Joins: 

     _{a a an} ^R ^R ₁

₁

Relational Algebra Equivalences

Projections:  

(Commute) 

    _{c c c c} ^{R R} ₁ ₂ ₂ ₁ 

   

 _... . . .

     _{c cn c cn} ^{R R} ₁ _{1  }

:    

(Cascade) More Equivalences

A projection commutes with a selection that only uses attributes retained by the projection.
Selection between attributes of the two arguments of a cross-product converts cross-product to a join.
• A selection on just attributes of R commutes with

     

R S. (i.e., (R S) (R) S ) 

Similarly, if a projection follows a join R S, we
can `push’ it by retaining only attributes of R (and S)
that are needed for the join or are kept by the projection.

Enumeration of Alternative Plans

Two core problems in an Optimizer

– Cost Estimation – Plan Enumeration • algebraic equivalence

There are two main cases:

– –

Single-relation plans

Multiple-relation plans

Single-Relation Queries

For queries over a single relation, queries consist of a combination of selects, projects, and aggregate ops:

– Each available access path (file scan / index) is considered, and the one with the least estimated cost is chosen. The different operations are essentially carried out – together (e.g., if an index is used for a selection, projection is done for each retrieved tuple, and the resulting tuples are into the aggregate

pipelined computation). Cost Estimation

For each plan considered, must estimate cost:

– Must estimate cost of each operation in plan tree.

Depends on input cardinalities.
We’ve already discussed how to estimate the cost of operations (sequential scan, index scan, joins, etc.)

Must also estimate for each size of result

– operation in tree! • Use information about the input relations.

For selections and joins, assume independence of predicates.

Typically, no duplicate elimination on projections! (Exception: Done on answers if user

Cost Estimates for Single-Relation Plans

Cost for finding the first match using Index I:

– Height(I)+1 for a B+ tree , about 1.2 for hash index.

Clustered index I matching one or more selects:

– (NPages(I)+NPages (R)) * product of RF’s of matching selects .

Non-clustered index I matching one or more selects:

–
(NPages(I)+NTuples (R)) * product of RF’s of matching
selects .

Sequential scan of file:

– NPages(R) . _{+ Note:}

SELECT

S.sid

Example _FROM _WHERE Sailors S

S.rating=8

index on :

If we have an rating

– Size of result: (1/NKeys(I)) * NTuples(R) = (1/10) * 40000 tuples retrieved.

_cost

– (1/10) * (50+500) pages are retrieved. (This is the . )
– Unclustered index: (1/NKeys(I)) * (NPages(I)+NTuples(R)) = (1/10) * (50+40000) pages are retrieved.

If we have an index on sid :

– index, the cost is 50+500 , with unclustered index, 50+40000 .

Would have to retrieve all tuples/pages. With a clustered

file scan :

Doing a

Example (II) – SQL

For each rating greater than 5, print the rating and the number of 20-year-old sailors with that rating, provided that there are at least two such sailors with different names.

SELECT S.rating, COUNT(*) FROM Sailors S WHERE S.rating > 5 AND S.age = 20 GROUP BY S.rating HAVING COUNT DISTINCT (S.same) > 2 Example (II) – RA Expression

For each rating greater than 5, print the rating and the number of 20-year-old sailors with that rating, provided that there are at least two such sailors with different names.

 _{GROUP BY S.rating (} _{HAVING COUNT DISTINCT (S.sname) > 2 (} (  _{S.rating, S.sname} ₍ _{S.rating = > 5 AND S.age = 20}  _{Sailors )))))} ₍ Example (II) – Plans without Indexes

For each rating greater than 5, print the rating and the number of 20-year-old sailors with that rating, provided that there are at least two such sailors with different names.

1. File scan + project + select (500 I/O) Π(RF _i

3. External sort for GROUP BY (2 passes, 60 I/O, not counting the output)

4. On-the-fly HAVING and aggregation (0 I/O)

Plans with Indexes

Single-Index Access Path

– Choose one access path with fewest I/Os

Multiple-Index Access Path

– Intersection

Sorted Index Access Path

– Grouping attributes is a prefix of a tree index
– Works well for clustered index

Index-Only Access Path

– All attributes mentioned in query are included in the search key for some index

Example (II) – Available Indexes

B+ tree index on rating
Hash index on age
B+ tree index on <rating, sname, age>
Alternative (2) for all indexes Exercise: calculate the cost for each alternative plan based on the cost, result size of simple operations (e.g., scan, index lookup) that we have discussed

Example (II) – Plans with Indexes (Cont.)

 _{S.rating, COUNT (*)} _{B+ tree index on rating} _{GROUP BY S.rating (} _{HAVING COUNT DISTINCT (S.sname) > 2 (} ( Hash index on age _{B+ tree index on <rating, S.rating, S.sname}  ₍ _{sname, age>} _{S.rating = > 5 AND S.age = 20}  _{Sailors )))))} ₍

Single-Index Access Path

– Use hash index on age

Multiple-Index Access Path

– Use hash index on age + tree index on rating
– Intersect rid’s and sort by page number
– retrieving Sailor tuples

Example (II) – Plans with Indexes

Sorted Index Access Path

– Retrieve Sailor tuples with rating>5, ordered
by rating using B+ tree index on rating (clustered vs. unclustered)
– The rest are all applied on-the-fly

Index-Only Access Path

– Retrieve data entries from <rating, sname, age> index with rating >5
– Sorted entries to applied the rest of the operators

^{B+ tree index on rating}

^{Hash index on age}

^{B+ tree index on <rating,}

^{sname, age>}

Queries Over Multiple Relations

Fundamental decision in System R: only left-deep are considered.

join trees

As the number of joins increases, the number of alternative

– plans grows rapidly; we need to restrict the search space.
– Left-deep trees allow us to generate all

fully pipelined plans .

Intermediate results not written to temporary files.
Not all left-deep trees are fully pipelined (e.g., SM join). _C _D _C

Enumeration of Left-Deep Plans

Left-deep plans differ only in (1) the order of relations, (2) the access method for each relation, and (3) the join method for each join.
Enumerated using N passes (if N relations joined): Pass 1: Find best 1-relation plan for each relation.

– –

Pass 2: Find best way to join result of each 1-relation plan (as outer) to another relation.

(All 2-relation plans.) Pass N: Find best way to join result of a (N-1)-relation plan

– (as outer) to the N’th relation.

(All N-relation plans.)

For each subset of relations, retain only:

– Cheapest plan overall, plus
– Cheapest plan for *each* of the tuples.

More details on Pass 1

Each single-relation plan is a partial left- deep plan
For each relation A in FROM list

– Identify selection terms mention only attributes of A – Identify attributes of A not mentioned in
SELECT or WHERE conditions that involves
attributes of other relations
– Choose best access method for A to carry
out these selection and project as in single-
relation query optimization

More details on interesting order

We retain the cheapest plan for each different ordering of results tuples
An ordering of tuples can be useful at a subsequent step (e.g., sort-merge, GROUP BY, ORDER BY)

More details on Pass 2 … N

For each inner (B) and outer (A) pair

– Identify selections that involve only attributes of B – Identify selections that defines the join between A and B – These selections need to be considered when choosing an access path for B – Identify attributes of B not mentioned in
SELECT or WHERE conditions that involves
attributes of other relations

More details on Pass 2 … N (Cont.)

– Only selection applied before the join are those that match the chosen access path
– remaining selections and projections are done on-the-fly as part of the join
– Tuples from the outer plan are pipelined into the join

Materialization required for sort-merge (if result not already sorted), hash-join

– For each join method, we must determine the best access method for B, similar to single-relation query optimization

Enumeration of Plans (Contd.) etc. handled as

ORDER BY, GROUP BY, aggregates

a final step, using either an `interestingly ordered’

An N-1 way plan is not combined with an additional relation unless there is a join condition between them, unless all predicates in have been used up.

i.e., avoid Cartesian products if possible.

– still

In spite of pruning plan space, this approach is exponential in the # of tables.

Cost Estimation for Multirelation Plans SELECT _FROM attribute list _{WHERE AND AND} relation list term1 ... termk

Consider a query block:
Maximum # tuples in result is the product of the _FROM cardinalities of relations in the clause.
associated with each

Reduction factor (RF) term reflects the impact of the in reducing result size.

Result cardinality = Max # tuples * product of all RF’s.

Multirelation plans are built up by joining one new relation at a time.

– Cost of join method, plus estimation of join cardinality gives

_{B+ tree on rating} Sailors: _sname

Example _Reserves: _{Hash on sid} B+ tree on bid

Pass1:

_{sid=sid <
– : B+ tree matches ,}

Sailors rating>5

and is probably cheapest. However, _{bid=100 rating > 5} if this selection is expected to retrieve a lot of tuples, and index is _{Reserves Sailors} unclustered, file scan may be cheaper.

Still, B+ tree plan kept (because tuples are in rating order).

– cheapest.

: B+ tree on matches ; Reserves bid bid=500

_{B+ tree on rating} Sailors: _sname

Example _Reserves: _{Hash on sid} B+ tree on bid

Pass2:

_{sid=sid <
– We consider each plan retained from Pass 1 as the outer, and consider how _{bid=100 rating > 5} to join it with the (only) other relation.
– Consideration 1: all available Join Methods: nested loop join, index NLJ, _{Reserves Sailors} sort-merge, special order useful?
– Consideration 2: Access Method to Sailors with Reserves as outer: Hash index can be used to get Sailors tuples that satisfy sid = outer tuple’s sid value. What if Sailors relation is the outer?
– –}

An example of three-way join query

Nested Queries without Correlations

The nested subquery can be evaluated just once _SELECT

S.sname

If only yield a single value, the _FROM _{WHERE S.rating =}

Sailors S value is incorporated into the _{SELECT MAX(S2.rating)} ( top-level query _{FROM Sailors S2}

If subquery returns a relation, practically a join

_{SELECT S.sname<In principle, different join _{FROM Sailors S} method can be used (e.g., index _{(SELECT R.sid} _{WHERE S.sid IN} nested loops join with Sailor as _{FROM Reserves R} the inner)
_{WHERE R.bid = 103 <In practice, nested loops join}}

Nested Queries with Correlation

S.sname _FROM Sailors S _{WHERE EXISTS}

( ^SELECT

* _FROM Reserves R _WHERE R.bid=103 _AND R.sid=S.sid) _{Nested block to optimize:}

Evaluation strategy: nested loop join with subquery as the inner

– Problem 1: subquery is evaluated once _{per outer tuple, same subquery can be} _{evaluated many times}
– Problem 2: precludes alternative join _{methods (e.g., sort-marge, hash join)}

SELECT

_FROM

Reserves R _WHERE R.bid=103 _AND

Nested subquery fully evaluated as a first step may lead to inefficiency
Implicit ordering of these blocks means that some good strategies are not considered.

S.sid= outer value _{Equivalent non-nested query:}

SELECT

S.sname _FROM Sailors S, Reserves R _WHERE

The non-nested version of the ^SELECT

S.sid=R.sid

Highlights of System R Optimizer

Impact:

– Most widely used currently; works well for < 10 joins.

Cost estimation: Approximate art at best.

– Statistics, maintained in system catalogs, used to estimate cost of operations and result sizes.
– • Plan Space: Too large, must be pruned.
– Only the space of

left-deep plans is considered.

• Left-deep plans allow output of each operator to be pipelined
^into
_{the next operator without storing it in a temporary relation.}

– Cartesian products avoided.

Summary

• Query optimization is an important task in a relational
DBMS.
Must understand optimization in order to understand the performance impact of a given database design (relations, indexes) on a workload (set of queries).
Two parts to optimizing a query:

– Consider a set of alternative plans.

Must prune search space; typically, left-deep plans only.

– Must estimate cost of each plan that is considered.

Key issues : Statistics, indexes, operator implementations.

Summary (Contd.)

Single-relation queries:

– All access paths considered, cheapest is chosen.
– has all needed fields and/or provides tuples in a desired order.

: Selections that index, whether index key

Issues match

Multiple-relation queries:

– All single-relation plans are first enumerated.

Selections/projections considered as early as possible.

Next, for each 1-relation plan, all ways of joining another

– relation (as inner) are considered. Next, for each 2- relation plan that is `retained’, all ways of
–

COP 5725 Fall 2012 Database Management Systems

Dokumen yang terkait

Relational Database Management System

Relational Database Management Systems for Epidemiologists: SQL Part I

Concurrency Control and Recovery in Database Systems pdf pdf

Concepts of Database Management 8th Edition pdf pdf

Oracle Nosql Database Management Enterprise 5721 pdf pdf

Database Reliability Engineering Designing and Operating Resilient Database Systems pdf pdf

Strategic Management of Information Systems in Healthcare

COP 5725 Fall 2012 Database Management Systems

COP 5725 Fall 2012 Database Management Systems

COP 5725 Fall 2012 Database Management Systems

Dukungan

Links

COP 5725 Fall 2012 Database Management Systems

Dokumen yang terkait

Relational Database Management System

Relational Database Management Systems for Epidemiologists: SQL Part I

Concurrency Control and Recovery in Database Systems pdf pdf

Concepts of Database Management 8th Edition pdf pdf

Oracle Nosql Database Management Enterprise 5721 pdf pdf

Database Reliability Engineering Designing and Operating Resilient Database Systems pdf pdf

Strategic Management of Information Systems in Healthcare

COP 5725 Fall 2012 Database Management Systems

COP 5725 Fall 2012 Database Management Systems

COP 5725 Fall 2012 Database Management Systems

Dokumen yang Anda mencari sudah siap untuk unduhkan