Partitioned Join

18.5.2.1 Partitioned Join

For certain kinds of joins, such as equi-joins and natural joins, it is possible to partition the two input relations across the processors and to compute the join locally at each processor. Suppose that we are using n processors and that the relations to be joined are r and s. Partitioned join then works this way: The system

partitions the relations r and s each into n partitions, denoted r 0 , r 1 ,..., r n− 1 and s 0 , s 1 ,..., s n− 1 . The system sends partitions r i and s i to processor P i , where their join is computed locally. The partitioned join technique works correctly only if the join is an equi-join (for example, r ✶ r. A=s.B s ) and if we partition r and s by the same partitioning function on their join attributes. The idea of partitioning is exactly the same as that behind the partitioning step of hash join. In a partitioned join, however, there are two different ways of partitioning r and s:

• Range partitioning on the join attributes. • Hash partitioning on the join attributes.

In either case, the same partitioning function must be used for both relations. For range partitioning, the same partition vector must be used for both relations. For hash partitioning, the same hash function must be used on both relations. Figure 18.2 depicts the partitioning in a partitioned parallel join.

Once the relations are partitioned, we can use any join technique locally at each processor P i to compute the join of r i and s i . For example, hash join, merge join, or nested-loop join could be used. Thus, we can use partitioning to parallelize any join technique.

Figure 18.2 Partitioned parallel join.

808 Chapter 18 Parallel Databases

If one or both of the relations r and s are already partitioned on the join attributes (by either hash partitioning or range partitioning), the work needed for partitioning is reduced greatly. If the relations are not partitioned, or are partitioned on attributes other than the join attributes, then the tuples need to

be repartitioned. Each processor P i reads in the tuples on disk D i , computes for each tuple t the partition j to which t belongs, and sends tuple t to processor P j . Processor P j stores the tuples on disk D j .

We can optimize the join algorithm used locally at each processor to reduce I/O by buffering some of the tuples to memory, instead of writing them to disk. We describe such optimizations in Section 18.5.2.3.

Skew presents a special problem when range partitioning is used, since a partition vector that splits one relation of the join into equal-sized partitions may split the other relations into partitions of widely varying size. The partition vector should be such that |r i |+| s i | (that is, the sum of the sizes of r i and s i ) is roughly equal over all the i = 0, 1, . . . , n − 1. With a good hash function, hash partitioning is likely to have a smaller skew, except when there are many tuples with the same values for the join attributes.