COP 5725 Fall 2012 Database Management Systems

  COP 5725 Fall 2012 Database Management Systems

  University of Florida, CISE Department Prof. Daisy Zhe Wang

  Relational Query Optimization Plan Equivalence Plan Enumeration

  Cost Estimation

  

Overview of Query Optimization

(Review)

  • :

  Query Plan Tree of R.A. ops, with choice of alg for each op.

  Each operator typically implemented using a `pull’

  • – interface: when an operator is `pulled’ for the next output tuples, it `pulls’ on its inputs and computes them.
    • Two main issues:

  • – For a given query, what plans are considered ? • Algorithm to search plan space for cheapest (estimated) plan.

  How is the cost of a plan estimated ?

  • – Want to find best plan. Practically: Avoid
    • Ideally:

  worst plans!

  Schema for Examples

  Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: dates, rname: string)

  • Similar to old schema; added for

  rname variations.

  • Reserves:
    • – Each tuple is 40 bytes long, 100 tuples per page, 1000 pages.

  • Sailors:
    • – Each tuple is 50 bytes long, 80 tuples per page,

  Translating SQL Queries into RA

  • Decomposing a query into blocks
    • – How a query is decomposed into blocks

  • Translate a query block as a RA Expression
    • – how the optimization of a single block can

      be understood in terms of plans composed

      of RA operators.

  Query Blocks: Units of Optimization

  • In a typical query optimizer, an SQL SELECT

  S.sname query is parsed into a collection of FROM Sailors S

  , and these are

  query blocks WHERE IN

  S.age optimized one block at a time. SELECT MAX ( (S2.age)

  • We focus mostly on optimizing a FROM

   Sailors S2

  single query block and defer GROUP BY

  S2.rating )

  discussion of nested query

  Outer block Nested block optimization. Another Example SELECT FROM S.sid, MIN (R.day) WHERE Sailors S, Reserves R, Boats B

  S.sid = R.sid AND R.bid = B.bid AND B.color

  = ‘red’ SELECT MAX AND S.rating = ( (S2.age) FROM GROUP BY Sailors S2) HAVING S.sid

  COUNT(*) >1

  Outer block Nested block

  Translate a query block to an RA Expression SELECT S.sid, MIN (R.day)

  • A single FROM Sailors S, Reserves R, Boats B query WHERE S.sid = R.sid AND R.bid = B.bid block: AND B.color
  • GROUP BY AND S.rating = Reference to nested block S.sid = ‘red’

      HAVING COUNT(*) >1

    • First step in optimizing a query block is to express it represent the semantics of an SQL query.

       S.sid, MIN (R.day) GROUP BY S.sid ( HAVING COUNT(*) >1 ( ( S.sid = R.sid AND R.bid = B.bid AND B.color = ‘red’

       AND S.rating = Reference to nested block ( Translate a query block to an RA Expression (Cont.)

    • Typical relational query optimizer recognize a query is essentially a SPJ RA expr., with the remaining opr. carried out on the result of the SPJ expr.
    • Move group by, having, aggregation as the last steps after the regular (SPJ) RA expression ×
    • S.sid, R.day (    S.sid = R.sid AND R.bid = B.bid AND B.color = ‘red’Sailors × Reserves × Boats )) AND S.rating = Reference to nested block (<
    • To make sure that group by and having can be carried out, the attributes mentioned are added to the projection of the SPJ expression

      Translate a query block to an RA Expression (Cont.)  S.sid, MIN (R.day) GROUP BY S.sid ( HAVING COUNT(*) &gt;1 ( (

       S.sid, R.day (

    S.sid = R.sid AND R.bid = B.bid AND B.color = ‘red’

    AND S.rating = Reference to nested block ( Sailors × Reserves × Boats )))))

    • The optimizer can find the best plan for the SPJ expression: plan enumeration + cost estimation
    • The plan is evaluated and resulting tuples are sorted/hashed for followed by the

      group by application of the clause and the . having aggregation S.sid, R.day ( S.sid = R.sid AND R.bid = B.bid AND B.color = ‘red’ AND S.rating = Reference to nested block ( Sailors × Reserves × Boats ))

      Plan Enumeration using RA Equivalences

    • An optimizer enumerates plans by applying several equivalences between RA expressions
    • Allow us to choose different join orders and to `push’ selections and projections ahead of joins.

    • Selections

      (Cascade)

      

       

       (Commute)

    R (S T) (T R) S

    • + Show that:

      (R S) (S R) 

      (Associative) 

       

      R (S T) (R S) T  

      Joins: 

           a a an R R 1

    1

     . . .

      

    Relational Algebra Equivalences

      Projections:  

      (Commute) 

          c c c c R R 1 2 2 1

         

         

       ... . . .

           c cn c cn R R 1 1  

      :    

      (Cascade) More Equivalences

    • A projection commutes with a selection that only uses attributes retained by the projection.
    • Selection between attributes of the two arguments of a cross-product converts cross-product to a join.
    • • A selection on just attributes of R commutes with

           

      R S. (i.e., (R S) (R) S ) 

    • Similarly, if a projection follows a join R S, we

      can `push’ it by retaining only attributes of R (and S)

      that are needed for the join or are kept by the projection.

      Enumeration of Alternative Plans

    • Two core problems in an Optimizer
      • – Cost Estimation – Plan Enumeration • algebraic equivalence

      >algorithms for opera
    • There are two main cases:
      • – –

      Single-relation plans

      Multiple-relation plans

      Single-Relation Queries

    • For queries over a single relation, queries consist of a combination of selects, projects, and aggregate ops:
      • – Each available access path (file scan / index) is considered, and the one with the least estimated cost is chosen. The different operations are essentially carried out – together (e.g., if an index is used for a selection, projection is done for each retrieved tuple, and the resulting tuples are into the aggregate

      pipelined computation). Cost Estimation

    • For each plan considered, must estimate cost:
      • – Must estimate cost of each operation in plan tree.

    • Depends on input cardinalities.
    • We’ve already discussed how to estimate the cost of operations (sequential scan, index scan, joins, etc.)

      Must also estimate for each size of result

    • – operation in tree! • Use information about the input relations.
      • For selections and joins, assume independence of predicates.

      Typically, no duplicate elimination on projections! (Exception: Done on answers if user

      Cost Estimates for Single-Relation Plans

    • Cost for finding the first match using Index I:
      • – Height(I)+1 for a B+ tree , about 1.2 for hash index.

    • Clustered index I matching one or more selects:
      • – (NPages(I)+NPages (R)) * product of RF’s of matching selects .

    • Non-clustered index I matching one or more selects:
      • (NPages(I)+NTuples (R)) * product of RF’s of matching

        selects .

    • Sequential scan of file:
      • – NPages(R) . + Note:

      SELECT

      S.sid

      Example FROM WHERE Sailors S

      S.rating=8

      index on :

    • If we have an rating
      • – Size of result: (1/NKeys(I)) * NTuples(R) = (1/10) * 40000 tuples retrieved.
      • cost Clustered index: (1/NKeys(I)) * (NPages(I)+NPages(
      • – (1/10) * (50+500) pages are retrieved. (This is the . )
      • – Unclustered index: (1/NKeys(I)) * (NPages(I)+NTuples(R)) = (1/10) * (50+40000) pages are retrieved.

    • If we have an index on sid :
      • – index, the cost is 50+500 , with unclustered index, 50+40000 .

      Would have to retrieve all tuples/pages. With a clustered

      file scan :

    • Doing a

      Example (II) – SQL

      For each rating greater than 5, print the rating and the number of 20-year-old sailors with that rating, provided that there are at least two such sailors with different names.

      SELECT S.rating, COUNT(*) FROM Sailors S WHERE S.rating &gt; 5 AND S.age = 20 GROUP BY S.rating HAVING COUNT DISTINCT (S.same) &gt; 2 Example (II) – RA Expression

      For each rating greater than 5, print the rating and the number of 20-year-old sailors with that rating, provided that there are at least two such sailors with different names.

       GROUP BY S.rating ( HAVING COUNT DISTINCT (S.sname) &gt; 2 ( (  S.rating, S.sname ( S.rating = &gt; 5 AND S.age = 20 Sailors ))))) ( Example (II) – Plans without Indexes

      For each rating greater than 5, print the rating and the number of 20-year-old sailors with that rating, provided that there are at least two such sailors with different names.

      1. File scan + project + select (500 I/O) Π(RF i

      3. External sort for GROUP BY (2 passes, 60 I/O, not counting the output)

      4. On-the-fly HAVING and aggregation (0 I/O)

      Plans with Indexes

    • Single-Index Access Path
      • – Choose one access path with fewest I/Os

    • Multiple-Index Access Path
      • – Intersection

    • Sorted Index Access Path
      • – Grouping attributes is a prefix of a tree index

      • – Works well for clustered index

    • Index-Only Access Path
      • – All attributes mentioned in query are included in the search key for some index

      Example (II) – Available Indexes

    • B+ tree index on rating
    • Hash index on age
    • B+ tree index on &lt;rating, sname, age&gt;
    • Alternative (2) for all indexes Exercise: calculate the cost for each alternative plan based on the cost, result size of simple operations (e.g., scan, index lookup) that we have discussed

      Example (II) – Plans with Indexes (Cont.)

       S.rating, COUNT (*) B+ tree index on rating GROUP BY S.rating ( HAVING COUNT DISTINCT (S.sname) &gt; 2 ( ( Hash index on age B+ tree index on &lt;rating, S.rating, S.sname( sname, age&gt; S.rating = &gt; 5 AND S.age = 20 Sailors ))))) (

    • Single-Index Access Path
      • – Use hash index on age

    • Multiple-Index Access Path
      • – Use hash index on age + tree index on rating
      • – Intersect rid’s and sort by page number
      • – retrieving Sailor tuples

      

    Example (II) – Plans with Indexes

    (Cont.)

    • Sorted Index Access Path
      • – Retrieve Sailor tuples with rating&gt;5, ordered

        by rating using B+ tree index on rating (clustered vs. unclustered)
      • – The rest are all applied on-the-fly

    • Index-Only Access Path
      • – Retrieve data entries from &lt;rating, sname, age&gt; index with rating &gt;5
      • – Sorted entries to applied the rest of the operators
      • B+ tree index on rating Hash index on age B+ tree index on &lt;rating, sname, age&gt;

      

    Queries Over Multiple Relations

    • Fundamental decision in System R: only left-deep are considered.

      join trees

      As the number of joins increases, the number of alternative

    • – plans grows rapidly; we need to restrict the search space.
    • – Left-deep trees allow us to generate all

      fully pipelined plans .

    • Intermediate results not written to temporary files.
    • Not all left-deep trees are fully pipelined (e.g., SM join). C D C

      D

      

    Enumeration of Left-Deep Plans

    • Left-deep plans differ only in (1) the order of relations, (2) the access method for each relation, and (3) the join method for each join.
    • Enumerated using N passes (if N relations joined): Pass 1: Find best 1-relation plan for each relation.
      • – –

      Pass 2: Find best way to join result of each 1-relation plan (as outer) to another relation.

      (All 2-relation plans.) Pass N: Find best way to join result of a (N-1)-relation plan

    • – (as outer) to the N’th relation.

      (All N-relation plans.)

    • For each subset of relations, retain only:
      • – Cheapest plan overall, plus
      • – Cheapest plan for *each* of the tuples.

      More details on Pass 1

    • Each single-relation plan is a partial left- deep plan
    • For each relation A in FROM list
      • – Identify selection terms mention only attributes of A – Identify attributes of A not mentioned in

        SELECT or WHERE conditions that involves

        attributes of other relations
      • – Choose best access method for A to carry

        out these selection and project as in single-

        relation query optimization

      More details on interesting order

    • We retain the cheapest plan for each different ordering of results tuples
    • An ordering of tuples can be useful at a subsequent step (e.g., sort-merge, GROUP BY, ORDER BY)

      More details on Pass 2 … N

    • For each inner (B) and outer (A) pair
      • – Identify selections that involve only attributes of B – Identify selections that defines the join between A and B – These selections need to be considered when choosing an access path for B – Identify attributes of B not mentioned in

        SELECT or WHERE conditions that involves

        attributes of other relations

      More details on Pass 2 … N (Cont.)

    • – Only selection applied before the join are those that match the chosen access path
    • – remaining selections and projections are done on-the-fly as part of the join
    • – Tuples from the outer plan are pipelined into the join
      • Materialization required for sort-merge (if result not already sorted), hash-join

    • – For each join method, we must determine the best access method for B, similar to single-relation query optimization

      Enumeration of Plans (Contd.) etc. handled as

    • ORDER BY, GROUP BY, aggregates

      

    a final step, using either an `interestingly ordered’

    plan or an additional sorting operator.

    • An N-1 way plan is not combined with an additional relation unless there is a join condition between them, unless all predicates in have been used up.

      i.e., avoid Cartesian products if possible.

    • – still
      • In spite of pruning plan space, this approach is exponential in the # of tables.

      Cost Estimation for Multirelation Plans SELECT FROM attribute list WHERE AND AND relation list term1 ... termk

    • Consider a query block:
    • Maximum # tuples in result is the product of the FROM cardinalities of relations in the clause.
    • associated with each

      Reduction factor (RF) term reflects the impact of the in reducing result size.

      Result cardinality = Max # tuples * product of all RF’s.

    • Multirelation plans are built up by joining one new relation at a time.
      • – Cost of join method, plus estimation of join cardinality gives

      B+ tree on rating Sailors: sname

      Example Reserves: Hash on sid B+ tree on bid

    • Pass1:
    • sid=sid <

      • – : B+ tree matches ,

      Sailors rating&gt;5

      and is probably cheapest. However, bid=100 rating &gt; 5 if this selection is expected to retrieve a lot of tuples, and index is Reserves Sailors unclustered, file scan may be cheaper.

    • Still, B+ tree plan kept (because tuples are in rating order).
      • – cheapest.

      : B+ tree on matches ; Reserves bid bid=500

      B+ tree on rating Sailors: sname

      Example Reserves: Hash on sid B+ tree on bid

    • Pass2:
    • sid=sid <

      • – We consider each plan retained from Pass 1 as the outer, and consider how bid=100 rating &gt; 5 to join it with the (only) other relation.
      • – Consideration 1: all available Join Methods: nested loop join, index NLJ, Reserves Sailors sort-merge, special order useful?
      • – Consideration 2: Access Method to Sailors with Reserves as outer: Hash index can be used to get Sailors tuples that satisfy sid = outer tuple’s sid value. What if Sailors relation is the outer?
      • – –

      An example of three-way join query

      Nested Queries without Correlations

    • The nested subquery can be evaluated just once SELECT

      S.sname

    • If only yield a single value, the FROM WHERE S.rating =

      Sailors S value is incorporated into the SELECT MAX(S2.rating) ( top-level query FROM Sailors S2

    • If subquery returns a relation, practically a join
    • SELECT S.sname<
    • In principle, different join FROM Sailors S method can be used (e.g., index (SELECT R.sid WHERE S.sid IN nested loops join with Sailor as FROM Reserves R the inner)
    • WHERE R.bid = 103 <
    • In practice, nested loops join

      Nested Queries with Correlation

      S.sname FROM Sailors S WHERE EXISTS

      ( SELECT

    • * FROM Reserves R WHERE R.bid=103 AND R.sid=S.sid) Nested block to optimize:
      • Evaluation strategy: nested loop join with subquery as the inner
        • – Problem 1: subquery is evaluated once per outer tuple, same subquery can be evaluated many times
        • – Problem 2: precludes alternative join methods (e.g., sort-marge, hash join)

      SELECT

    • FROM

      Reserves R WHERE R.bid=103 AND

    • Nested subquery fully evaluated as a first step may lead to inefficiency
    • Implicit ordering of these blocks means that some good strategies are not considered.

      S.sid= outer value Equivalent non-nested query:

      SELECT

      S.sname FROM Sailors S, Reserves R WHERE

    • The non-nested version of the SELECT

      S.sid=R.sid

      

    Highlights of System R Optimizer

    (Review)

    • Impact:
      • – Most widely used currently; works well for &lt; 10 joins.

    • Cost estimation: Approximate art at best.
      • – Statistics, maintained in system catalogs, used to estimate cost of operations and result sizes.
      • – • Plan Space: Too large, must be pruned.
      • – Only the space of

      left-deep plans is considered.

    • • Left-deep plans allow output of each operator to be pipelined

      into

      the next operator without storing it in a temporary relation.

      • – Cartesian products avoided.

      Summary

    • • Query optimization is an important task in a relational

      DBMS.
    • Must understand optimization in order to understand the performance impact of a given database design (relations, indexes) on a workload (set of queries).
    • Two parts to optimizing a query:
      • – Consider a set of alternative plans.

    • Must prune search space; typically, left-deep plans only.
      • – Must estimate cost of each plan that is considered.

      >Must estimate size of result and cost for each plan node.
    • Key issues : Statistics, indexes, operator implementations.

      Summary (Contd.)

    • Single-relation queries:
      • – All access paths considered, cheapest is chosen.
      • – has all needed fields and/or provides tuples in a desired order.

      : Selections that index, whether index key

      Issues match

    • Multiple-relation queries:
      • – All single-relation plans are first enumerated.

    • Selections/projections considered as early as possible.

      Next, for each 1-relation plan, all ways of joining another

    • – relation (as inner) are considered. Next, for each 2- relation plan that is `retained’, all ways of
    • >– joining another relation (as inner) are considered,