HPL Single Node Evaluation

21.4.2 HPL Single Node Evaluation

In order to determine whether the effects we have seen using DGEMM scale to multiple nodes, we use the HPL benchmark (Petitet et al., 2010 ) that solves a system of linear equations via LU factorization. HPL can scale to large compute grids. However, we first consider HPL given similar matrices to the previous DGEMM experiments, namely, solving a system of linear equations on a single node where the data does fit and does not fit into the LLC of the processor. In the following section, we extend HPL to examine internode scaling using parallel algorithms.

In Fig. 21.5 we repeatedly execute HPL with n = 25 k on the Amazon EC2 c1.xlarge instance. We plot average and minimum execution times against the num- ber of cores. As with the DGEMM experiments, error bars show one standard

Minimum Average

1600 min 1 core = 1203.42 min 2cores = 612.490

1400 min 4cores = 321.110 min 6cores = 228.250 min 8cores = 182.440

ime (secs) 800 T 600 400 200

Number of cores

Fig. 21.5 Average execution time of repeated HPL with n = 25 k on c1.xlarge instance by number of cores used. The matrices cannot fit within the last-level cache. The input configuration to each execution of HPL is identical. Error bars on average show one standard deviation

21 HPC on Competitive Cloud Resources 505 deviation. As with the first DGEMM experiment, these matrices do not fit within

the LLC of the processor, but do fit within the main memory of a single instance. The results are quite similar to DGEMM, but show slightly different trends. We do not directly compare the DGEMM and HPL experiments because of the use of very different algorithms. We only point out that the trends show similar behavior on a single node.

In the last single node figure, we show HPL with a matrix of size n = 1 k so that all data for the computation fits within the LLC. Figure 21.6 plots the average (and one standard deviation) and minimum execution times of repeated runs of the HPL benchmark. The results show similar behavior to the corresponding DGEMM experiment. We could not find a source for the anomalous improved performance at six cores as compared to four or eight. In the following section, we extend HPL to multiple nodes to take a look at the effects of contention on parallel computations in a commercial cloud environment.

Minimum Average min 1 core = 0.100 50 min 2cores = 0.060 min 4cores = 0.070

min 6cores = 0.070 min 8cores = 0.090

30 T ime (secs) 20

Number of cores

Fig. 21.6 Average execution time of repeated HPL with n = 1 k on c1.xlarge instance by number of cores used. The matrices fit within the last-level cache. The input configuration to each execution of HPL is identical. Error bars on average show one standard deviation

From the single node experiments we conclude that the very high variance in performance implies that the best achieved performance is not a good measure of expected performance. The average expected performance can be several orders of magnitude worse depending on the cache behavior of the algorithm in both efficiency and time to solution. In our experiments, the best average expected per- formance is obtained on Amazon using much less of the machine than is allocated: In an experiment accessing a large amount of main memory, using only a quarter of the machine provided the best expected performance!

506 P. Bientinesi et al.