Hybrid Hash Join
12.5.5.5 Hybrid Hash Join
The hybrid hash-join algorithm performs another optimization; it is useful when memory sizes are relatively large, but not all of the build relation fits in memory. The partitioning phase of the hash-join algorithm needs a minimum of one block of memory as a buffer for each partition that is created, and one block of memory as an input buffer. To reduce the impact of seeks, a larger number of blocks would
be used as a buffer; let b b denote the number of blocks used as a buffer for the input and for each partition. Hence, a total of (n h + 1) ∗ b b blocks of memory are needed for partitioning the two relations. If memory is larger than (n h + 1) ∗ b b , we can use the rest of memory (M− (n h + 1) ∗ b b blocks) to buffer the first partition of the build input (that is, s 0 ), so that it will not need to be written out and read back in. Further, the hash function is designed in such a way that the hash index on s 0 fits in M − (n h + 1) ∗ b b blocks, in order that, at the end of partitioning of s, s 0 is completely in memory and a hash index can be built on s 0 . When the system partitions r it again does not write tuples in r 0 to disk; instead, as it generates them, the system uses them to probe the memory-resident hash index on s 0 , and to generate output tuples of the join. After they are used for probing, the tuples can be discarded, so the partition r 0 does not occupy any memory space. Thus, a write and a read access have been saved for each block of both r 0 and s 0 . The system writes out tuples in the other partitions as usual, and joins them later. The savings of hybrid hash join can be significant if the build input is only slightly bigger than memory.
If the size of the build relation is b s ,n h is approximately equal to b √ s / M . Thus, hybrid hash join is most useful if M >> (b s / M )∗b b , or M >> b s ∗b b , where the notation >> denotes much larger than. For example, suppose the block size is
4 kilobytes, the build relation size is 5 gigabytes, and b b is 20. Then, the hybrid hash-join algorithm is useful if the size of memory is significantly more than 20 megabytes; memory sizes of gigabytes or more are common on computers today.
If we devote 1 gigabyte for the join algorithm, s 0 would be nearly 1 gigabyte, and
hybrid hash join would be nearly 20 percent cheaper than hash join.
12.6 Other Operations 563
12.5.6 Complex Joins
Nested-loop and block nested-loop joins can be used regardless of the join condi- tions. The other join techniques are more efficient than the nested-loop join and its variants, but can handle only simple join conditions, such as natural joins or equi-joins. We can implement joins with complex join conditions, such as con- junctions and disjunctions, by using the efficient join techniques, if we apply the techniques developed in Section 12.3.3 for handling complex selections.
Consider the following join with a conjunctive condition:
r✶ 1 ∧ 2 ∧···∧ n s
One or more of the join techniques described earlier may be applicable for joins on the individual conditions r ✶ 1 s ,r✶ 2 s ,r✶ 3 s , and so on. We can compute the overall join by first computing the result of one of these simpler joins r ✶ i s ; each pair of tuples in the intermediate result consists of one tuple from r and one from s. The result of the complete join consists of those tuples in the intermediate result that satisfy the remaining conditions:
1 ∧···∧ i−1 ∧ i+1 ∧···∧ n
These conditions can be tested as tuples in r ✶ i s are being generated.
A join whose condition is disjunctive can be computed in this way. Consider:
r✶ 1 ∨ 2 ∨···∨ n s
The join can be computed as the union of the records in individual joins r ✶ i s :
(r ✶ 1 s ) ∪ (r ✶ 2 s ) ∪ · · · ∪ (r ✶ n s )
Section 12.6 describes algorithms for computing the union of relations.
Parts
» Indian Institute of Technology, Bombay
» Data Mining and Information Retrieval
» Structure of Relational Databases
» Database Schema When we talk about a database, we must differentiate between the database
» Basic Structure of SQL Queries
» Modification of the Database
» • Embedded SQL : Like dynamic SQL , embedded SQL provides a means by
» Advanced Aggregation Features**
» The Cartesian-Product Operation
» The Tuple Relational Calculus
» The Entity-Relationship Model
» • For an n-ary relationship set with an arrow on one of its edges, the primary
» Entity-Relationship Design Issues
» Representation of Generalization
» Alternative Notations for Modeling Data
» Other Aspects of Database Design
» Features of Good Relational Designs
» Atomic Domains and First Normal Form
» Decomposition Using Functional Dependencies
» BCNF Decomposition Algorithm
» Decomposition Using Multivalued Dependencies
» Application Programs and User Interfaces
» Overview of Physical Storage Media
» Magnetic Disk and Flash Storage
» Organization of Records in Files
» Comparison of Ordered Indexing and Hashing
» Implementation of Pipelining
» Evaluation Algorithms for Pipelining
» Transformation of Relational Expressions
» (A, r ), the number of distinct values that appear in the relation r for attribute
» Advanced Topics in Query Optimization**
» Transaction Atomicity and Durability
» Transaction Isolation and Atomicity
» Implementation of Isolation Levels
» Transactions as SQL Statements
» Weak Levels of Consistency in Practice
» Concurrency in Index Structures**
» Failure with Loss of Nonvolatile Storage
» Early Lock Release and Logical Undo Operations
» Centralized and Client – Server Architectures
» Parallelism on Multicore Processors
» Recovery and Concurrency Control
» Distributed Query Processing
» Heterogeneous Distributed Databases
» Partitioning and Retrieving Data
» Transactions and Replication
» Decision-Tree Construction Algorithm
» Relevance Ranking Using Terms
» Synonyms, Homonyms, and Ontologies
» Crawling and Indexing the Web
» Information Retrieval: Beyond Ranking of Pages
» Structured Types and Inheritance in SQL
» Array and Multiset Types in SQL
» Application Program Interfaces to XML
» Native Storage within a Relational Database
» Other Issues in Application Development
» Representation of Geographic Data
» Transaction-Processing Monitors
» Real-Time Transaction Systems
» PostgreSQL Implementation of MVCC
» Database Design and Querying Tools
» Database Administration Tools
» Business Intelligence Features
Show more