COP 5725 Fall 2012 Database Management Systems
COP 5725 Fall 2012 Database Management Systems
University of Florida, CISE Department Prof. Daisy Zhe Wang
Relational Query Optimization Plan Equivalence Plan Enumeration
Cost Estimation
Overview of Query Optimization
(Review)- :
Query Plan Tree of R.A. ops, with choice of alg for each op.
Each operator typically implemented using a `pull’
- – interface: when an operator is `pulled’ for the next output tuples, it `pulls’ on its inputs and computes them.
- Two main issues:
- – For a given query, what plans are considered ? • Algorithm to search plan space for cheapest (estimated) plan.
How is the cost of a plan estimated ?
- – Want to find best plan. Practically: Avoid
- Ideally:
worst plans!
Schema for Examples
Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: dates, rname: string)
- Similar to old schema; added for
rname variations.
- Reserves:
- – Each tuple is 40 bytes long, 100 tuples per page, 1000 pages.
- Sailors:
- – Each tuple is 50 bytes long, 80 tuples per page,
Translating SQL Queries into RA
- Decomposing a query into blocks
- – How a query is decomposed into blocks
- Translate a query block as a RA Expression
- – how the optimization of a single block can
be understood in terms of plans composed
of RA operators.
Query Blocks: Units of Optimization
- In a typical query optimizer, an SQL SELECT
S.sname query is parsed into a collection of FROM Sailors S
, and these are
query blocks WHERE IN
S.age optimized one block at a time. SELECT MAX ( (S2.age)
- We focus mostly on optimizing a FROM
Sailors S2
single query block and defer GROUP BY
S2.rating )
discussion of nested query
Outer block Nested block optimization. Another Example SELECT FROM S.sid, MIN (R.day) WHERE Sailors S, Reserves R, Boats B
S.sid = R.sid AND R.bid = B.bid AND B.color
= ‘red’ SELECT MAX AND S.rating = ( (S2.age) FROM GROUP BY Sailors S2) HAVING S.sid
COUNT(*) >1
Outer block Nested block
Translate a query block to an RA Expression SELECT S.sid, MIN (R.day)
- A single FROM Sailors S, Reserves R, Boats B query WHERE S.sid = R.sid AND R.bid = B.bid block: AND B.color GROUP BY AND S.rating = Reference to nested block S.sid = ‘red’
- First step in optimizing a query block is to express it represent the semantics of an SQL query.
- Typical relational query optimizer recognize a query is essentially a SPJ RA expr., with the remaining opr. carried out on the result of the SPJ expr.
- Move group by, having, aggregation as the last steps after the regular (SPJ) RA expression × S.sid, R.day ( S.sid = R.sid AND R.bid = B.bid AND B.color = ‘red’ Sailors × Reserves × Boats )) AND S.rating = Reference to nested block (<
- To make sure that group by and having can be carried out, the attributes mentioned are added to the projection of the SPJ expression
- The optimizer can find the best plan for the SPJ expression: plan enumeration + cost estimation
- The plan is evaluated and resulting tuples are sorted/hashed for followed by the
- An optimizer enumerates plans by applying several equivalences between RA expressions
- Allow us to choose different join orders and to `push’ selections and projections ahead of joins.
- Selections
- + Show that:
- A projection commutes with a selection that only uses attributes retained by the projection.
- Selection between attributes of the two arguments of a cross-product converts cross-product to a join.
• A selection on just attributes of R commutes with
- Similarly, if a projection follows a join R S, we
can `push’ it by retaining only attributes of R (and S)
that are needed for the join or are kept by the projection. - Two core problems in an Optimizer
- – Cost Estimation – Plan Enumeration • algebraic equivalence
- There are two main cases:
- – –
- For queries over a single relation, queries consist of a combination of selects, projects, and aggregate ops:
- – Each available access path (file scan / index) is considered, and the one with the least estimated cost is chosen. The different operations are essentially carried out – together (e.g., if an index is used for a selection, projection is done for each retrieved tuple, and the resulting tuples are into the aggregate
- For each plan considered, must estimate cost:
- – Must estimate cost of each operation in plan tree.
- Depends on input cardinalities.
- We’ve already discussed how to estimate the cost of operations (sequential scan, index scan, joins, etc.)
- – operation in tree! • Use information about the input relations.
- For selections and joins, assume independence of predicates.
- Cost for finding the first match using Index I:
- – Height(I)+1 for a B+ tree , about 1.2 for hash index.
- Clustered index I matching one or more selects:
- – (NPages(I)+NPages (R)) * product of RF’s of matching selects .
- Non-clustered index I matching one or more selects:
- –
(NPages(I)+NTuples (R)) * product of RF’s of matching
selects . - Sequential scan of file:
- – NPages(R) . + Note:
- If we have an rating
- – Size of result: (1/NKeys(I)) * NTuples(R) = (1/10) * 40000 tuples retrieved. cost Clustered index: (1/NKeys(I)) * (NPages(I)+NPages(
- – (1/10) * (50+500) pages are retrieved. (This is the . )
- – Unclustered index: (1/NKeys(I)) * (NPages(I)+NTuples(R)) = (1/10) * (50+40000) pages are retrieved.
- If we have an index on sid :
- – index, the cost is 50+500 , with unclustered index, 50+40000 .
- Doing a
- Single-Index Access Path
- – Choose one access path with fewest I/Os
- Multiple-Index Access Path
- – Intersection
- Sorted Index Access Path
– Grouping attributes is a prefix of a tree index
- – Works well for clustered index
- Index-Only Access Path
- – All attributes mentioned in query are included in the search key for some index
- B+ tree index on rating
- Hash index on age
- B+ tree index on <rating, sname, age>
- Alternative (2) for all indexes Exercise: calculate the cost for each alternative plan based on the cost, result size of simple operations (e.g., scan, index lookup) that we have discussed
- Single-Index Access Path
- – Use hash index on age
- Multiple-Index Access Path
- – Use hash index on age + tree index on rating
- – Intersect rid’s and sort by page number
- – retrieving Sailor tuples
- Sorted Index Access Path
– Retrieve Sailor tuples with rating>5, ordered
by rating using B+ tree index on rating (clustered vs. unclustered)- – The rest are all applied on-the-fly
- Index-Only Access Path
- – Retrieve data entries from <rating, sname, age> index with rating >5
- – Sorted entries to applied the rest of the operators B+ tree index on rating Hash index on age B+ tree index on <rating, sname, age>
- Fundamental decision in System R: only left-deep are considered.
- – plans grows rapidly; we need to restrict the search space.
- – Left-deep trees allow us to generate all
- Intermediate results not written to temporary files.
- Not all left-deep trees are fully pipelined (e.g., SM join). C D C
- Left-deep plans differ only in (1) the order of relations, (2) the access method for each relation, and (3) the join method for each join.
- Enumerated using N passes (if N relations joined): Pass 1: Find best 1-relation plan for each relation.
- – –
- – (as outer) to the N’th relation.
- For each subset of relations, retain only:
- – Cheapest plan overall, plus
- – Cheapest plan for *each* of the tuples.
- Each single-relation plan is a partial left- deep plan
- For each relation A in FROM list
- – Identify selection terms mention only attributes of A – Identify attributes of A not mentioned in
SELECT or WHERE conditions that involves
attributes of other relations - – Choose best access method for A to carry
out these selection and project as in single-
relation query optimization - We retain the cheapest plan for each different ordering of results tuples
- An ordering of tuples can be useful at a subsequent step (e.g., sort-merge, GROUP BY, ORDER BY)
- For each inner (B) and outer (A) pair
- – Identify selections that involve only attributes of B – Identify selections that defines the join between A and B – These selections need to be considered when choosing an access path for B – Identify attributes of B not mentioned in
SELECT or WHERE conditions that involves
attributes of other relations - – Only selection applied before the join are those that match the chosen access path
- – remaining selections and projections are done on-the-fly as part of the join
- – Tuples from the outer plan are pipelined into the join
- Materialization required for sort-merge (if result not already sorted), hash-join
- – For each join method, we must determine the best access method for B, similar to single-relation query optimization
- ORDER BY, GROUP BY, aggregates
- An N-1 way plan is not combined with an additional relation unless there is a join condition between them, unless all predicates in have been used up.
- – still
- In spite of pruning plan space, this approach is exponential in the # of tables.
- Consider a query block:
- Maximum # tuples in result is the product of the FROM cardinalities of relations in the clause.
- associated with each
- Multirelation plans are built up by joining one new relation at a time.
- – Cost of join method, plus estimation of join cardinality gives
- Pass1: sid=sid <
- – : B+ tree matches ,
- Still, B+ tree plan kept (because tuples are in rating order).
- – cheapest.
- Pass2: sid=sid <
- – We consider each plan retained from Pass 1 as the outer, and consider how bid=100 rating > 5 to join it with the (only) other relation.
- – Consideration 1: all available Join Methods: nested loop join, index NLJ, Reserves Sailors sort-merge, special order useful?
- – Consideration 2: Access Method to Sailors with Reserves as outer: Hash index can be used to get Sailors tuples that satisfy sid = outer tuple’s sid value. What if Sailors relation is the outer?
- – –
- The nested subquery can be evaluated just once SELECT
- If only yield a single value, the FROM WHERE S.rating =
- If subquery returns a relation, practically a join SELECT S.sname<
- In principle, different join FROM Sailors S method can be used (e.g., index (SELECT R.sid WHERE S.sid IN nested loops join with Sailor as FROM Reserves R the inner) WHERE R.bid = 103 <
- In practice, nested loops join
- * FROM Reserves R WHERE R.bid=103 AND R.sid=S.sid) Nested block to optimize:
- Evaluation strategy: nested loop join with subquery as the inner
- – Problem 1: subquery is evaluated once per outer tuple, same subquery can be evaluated many times
- – Problem 2: precludes alternative join methods (e.g., sort-marge, hash join)
- FROM
- Nested subquery fully evaluated as a first step may lead to inefficiency
- Implicit ordering of these blocks means that some good strategies are not considered.
- The non-nested version of the SELECT
- Impact:
- – Most widely used currently; works well for < 10 joins.
- Cost estimation: Approximate art at best.
- – Statistics, maintained in system catalogs, used to estimate cost of operations and result sizes.
- – • Plan Space: Too large, must be pruned.
- – Only the space of
• Left-deep plans allow output of each operator to be pipelined
intothe next operator without storing it in a temporary relation.
- – Cartesian products avoided.
• Query optimization is an important task in a relational
DBMS.- Must understand optimization in order to understand the performance impact of a given database design (relations, indexes) on a workload (set of queries).
- Two parts to optimizing a query:
- – Consider a set of alternative plans.
- Must prune search space; typically, left-deep plans only.
- – Must estimate cost of each plan that is considered.
- Key issues : Statistics, indexes, operator implementations.
- Single-relation queries:
- – All access paths considered, cheapest is chosen.
- – has all needed fields and/or provides tuples in a desired order.
- Multiple-relation queries:
- – All single-relation plans are first enumerated.
- Selections/projections considered as early as possible.
- – relation (as inner) are considered. Next, for each 2- relation plan that is `retained’, all ways of >– joining another relation (as inner) are considered,
- –
HAVING COUNT(*) >1
S.sid, MIN (R.day) GROUP BY S.sid ( HAVING COUNT(*) >1 ( ( S.sid = R.sid AND R.bid = B.bid AND B.color = ‘red’
AND S.rating = Reference to nested block ( Translate a query block to an RA Expression (Cont.)
Translate a query block to an RA Expression (Cont.) S.sid, MIN (R.day) GROUP BY S.sid ( HAVING COUNT(*) >1 ( (
S.sid, R.day (
S.sid = R.sid AND R.bid = B.bid AND B.color = ‘red’
AND S.rating = Reference to nested block ( Sailors × Reserves × Boats )))))group by application of the clause and the . having aggregation S.sid, R.day ( S.sid = R.sid AND R.bid = B.bid AND B.color = ‘red’ AND S.rating = Reference to nested block ( Sailors × Reserves × Boats ))
Plan Enumeration using RA Equivalences
(Cascade)
(Commute)
R (S T) (T R) S
(R S) (S R)
(Associative)
R (S T) (R S) T
Joins:
a a an R R 1
1
. . .
Relational Algebra Equivalences
Projections:
(Commute)
c c c c R R 1 2 2 1
... . . .
c cn c cn R R 1 1
:
(Cascade) More Equivalences
R S. (i.e., (R S) (R) S )
Enumeration of Alternative Plans
Single-relation plans
Multiple-relation plans
Single-Relation Queries
pipelined computation). Cost Estimation
Must also estimate for each size of result
Typically, no duplicate elimination on projections! (Exception: Done on answers if user
Cost Estimates for Single-Relation Plans
SELECT
S.sid
Example FROM WHERE Sailors S
S.rating=8
index on :
Would have to retrieve all tuples/pages. With a clustered
file scan :
Example (II) – SQL
For each rating greater than 5, print the rating and the number of 20-year-old sailors with that rating, provided that there are at least two such sailors with different names.
SELECT S.rating, COUNT(*) FROM Sailors S WHERE S.rating > 5 AND S.age = 20 GROUP BY S.rating HAVING COUNT DISTINCT (S.same) > 2 Example (II) – RA Expression
For each rating greater than 5, print the rating and the number of 20-year-old sailors with that rating, provided that there are at least two such sailors with different names.
GROUP BY S.rating ( HAVING COUNT DISTINCT (S.sname) > 2 ( ( S.rating, S.sname ( S.rating = > 5 AND S.age = 20 Sailors ))))) ( Example (II) – Plans without Indexes
For each rating greater than 5, print the rating and the number of 20-year-old sailors with that rating, provided that there are at least two such sailors with different names.
1. File scan + project + select (500 I/O) Π(RF i
3. External sort for GROUP BY (2 passes, 60 I/O, not counting the output)
4. On-the-fly HAVING and aggregation (0 I/O)
Plans with Indexes
Example (II) – Available Indexes
Example (II) – Plans with Indexes (Cont.)
S.rating, COUNT (*) B+ tree index on rating GROUP BY S.rating ( HAVING COUNT DISTINCT (S.sname) > 2 ( ( Hash index on age B+ tree index on <rating, S.rating, S.sname ( sname, age> S.rating = > 5 AND S.age = 20 Sailors ))))) (
Example (II) – Plans with Indexes
(Cont.)
Queries Over Multiple Relations
join trees
As the number of joins increases, the number of alternative
fully pipelined plans .
D
Enumeration of Left-Deep Plans
Pass 2: Find best way to join result of each 1-relation plan (as outer) to another relation.
(All 2-relation plans.) Pass N: Find best way to join result of a (N-1)-relation plan
(All N-relation plans.)
More details on Pass 1
More details on interesting order
More details on Pass 2 … N
More details on Pass 2 … N (Cont.)
Enumeration of Plans (Contd.) etc. handled as
a final step, using either an `interestingly ordered’
plan or an additional sorting operator.i.e., avoid Cartesian products if possible.
Cost Estimation for Multirelation Plans SELECT FROM attribute list WHERE AND AND relation list term1 ... termk
Reduction factor (RF) term reflects the impact of the in reducing result size.
Result cardinality = Max # tuples * product of all RF’s.
B+ tree on rating Sailors: sname
Example Reserves: Hash on sid B+ tree on bid
Sailors rating>5
and is probably cheapest. However, bid=100 rating > 5 if this selection is expected to retrieve a lot of tuples, and index is Reserves Sailors unclustered, file scan may be cheaper.
: B+ tree on matches ; Reserves bid bid=500
B+ tree on rating Sailors: sname
Example Reserves: Hash on sid B+ tree on bid
An example of three-way join query
Nested Queries without Correlations
S.sname
Sailors S value is incorporated into the SELECT MAX(S2.rating) ( top-level query FROM Sailors S2
Nested Queries with Correlation
S.sname FROM Sailors S WHERE EXISTS
( SELECT
SELECT
Reserves R WHERE R.bid=103 AND
S.sid= outer value Equivalent non-nested query:
SELECT
S.sname FROM Sailors S, Reserves R WHERE
S.sid=R.sid
Highlights of System R Optimizer
(Review)left-deep plans is considered.
Summary
Summary (Contd.)
: Selections that index, whether index key
Issues match
Next, for each 1-relation plan, all ways of joining another