Parallel Nested-Loop Join

18.5.2.4 Parallel Nested-Loop Join

To illustrate the use of fragment-and-replicate–based parallelization, consider the case where the relation s is much smaller than relation r . Suppose that relation r is stored by partitioning; the attribute on which it is partitioned does not matter. Suppose too that there is an index on a join attribute of relation r at each of the partitions of relation r .

We use asymmetric fragment and replicate, with relation s being replicated and with the existing partitioning of relation r . Each processor P j where a partition of relation s is stored reads the tuples of relation s stored in D j , and replicates the tuples to every other processor P i . At the end of this phase, relation s is replicated at all sites that store tuples of relation r .

Now, each processor P i performs an indexed nested-loop join of relation s with the ith partition of relation r . We can overlap the indexed nested-loop join

18.5 Intraoperation Parallelism 811

with the distribution of tuples of relation s, to reduce the costs of writing the tuples of relation s to disk, and of reading them back. However, the replication of relation s must be synchronized with the join so that there is enough space in the in-memory buffers at each processor P i to hold the tuples of relation s that have been received but that have not yet been used in the join.

18.5.3 Other Relational Operations

The evaluation of other relational operations also can be parallelized:

• Selection

␪ (r ). Consider first the case where ␪ is of the form a i =v , where a i is an attribute and v is a value. If the relation r is

partitioned on a i , the selection proceeds at a single processor. If ␪ is of the form l ≤ a i ≤ u —that is, ␪ is a range selection—and the relation has been range-partitioned on a i , then the selection proceeds at each processor whose partition overlaps with the specified range of values. In all other cases, the selection proceeds in parallel at all the processors.

• Duplicate elimination . Duplicates can be eliminated by sorting; either of

the parallel sort techniques can be used, optimized to eliminate duplicates as soon as they appear during sorting. We can also parallelize duplicate elimination by partitioning the tuples (by either range or hash partitioning) and eliminating duplicates locally at each processor.

• Projection . Projection without duplicate elimination can be performed as

tuples are read in from disk in parallel. If duplicates are to be eliminated, either of the techniques just described can be used.

• Aggregation . Consider an aggregation operation. We can parallelize the op-

eration by partitioning the relation on the grouping attributes, and then com- puting the aggregate values locally at each processor. Either hash partitioning or range partitioning can be used. If the relation is already partitioned on the grouping attributes, the first step can be skipped.

We can reduce the cost of transferring tuples during partitioning by partly computing aggregate values before partitioning, at least for the commonly used aggregate functions. Consider an aggregation operation on a relation r , using the sum aggregate function on attribute B, with grouping on attribute

A . The system can perform the operation at each processor P i on those r tuples stored on disk D i . This computation results in tuples with partial sums at each processor; there is one tuple at P i for each value for attribute A present in r tuples stored on D i . The system partitions the result of the local aggregation on the grouping attribute A, and performs the aggregation again (on tuples with the partial sums) at each processor P i to get the final result.

As a result of this optimization, fewer tuples need to be sent to other processors during partitioning. This idea can be extended easily to the min and max aggregate functions. Extensions to the count and avg aggregate functions are left for you to do in Exercise 18.12.

812 Chapter 18 Parallel Databases

The parallelization of other operations is covered in several of the exercises.

18.5.4 Cost of Parallel Evaluation of Operations

We achieve parallelism by partitioning the I/O among multiple disks, and par- titioning the CPU work among multiple processors. If such a split is achieved without any overhead, and if there is no skew in the splitting of work, a parallel operation using n processors will take 1/n times as long as the same operation on a single processor. We already know how to estimate the cost of an operation such as a join or a selection. The time cost of parallel processing would then be 1/n of the time cost of sequential processing of the operation.

We must also account for the following costs: • Start-up costs for initiating the operation at multiple processors.

• Skew in the distribution of work among the processors, with some processors

getting a larger number of tuples than others. • Contention for resources —such as memory, disk, and the communication

network—resulting in delays. • Cost of assembling the final result by transmitting partial results from each

processor. The time taken by a parallel operation can be estimated as:

T part + T asm + max(T 0 , T 1 ,..., T n− 1 )

where T part is the time for partitioning the relations, T asm is the time for assembling the results, and T i is the time taken for the operation at processor P i . Assuming that the tuples are distributed without any skew, the number of tuples sent to each processor can be estimated as 1/n of the total number of tuples. Ignoring contention, the cost T i of the operations at each processor P i can then be estimated by the techniques in Chapter 12.

The preceding estimate will be an optimistic estimate, since skew is common. Even though breaking down a single query into a number of parallel steps reduces the size of the average step, it is the time for processing the single slowest step that determines the time taken for processing the query as a whole. A partitioned parallel evaluation, for instance, is only as fast as the slowest of the parallel executions. Thus, any skew in the distribution of the work across processors greatly affects performance.

The problem of skew in partitioning is closely related to the problem of partition overflow in sequential hash joins (Chapter 12). We can use overflow resolution and avoidance techniques developed for hash joins to handle skew when hash partitioning is used. We can use balanced range partitioning and virtual processor partitioning to minimize skew due to range partitioning, as in Section 18.2.3.

18.6 Interoperation Parallelism 813