BMAA3-2-27. 333KB Jun 04 2011 12:03:33 AM
Bulletin of Mathematical Analysis and Applications
ISSN: 1821-1291, URL: http://www.bmathaa.org
Volume 3 Issue 2(2011), Pages 269-277.
STRASSENโS MATRIX MULTIPLICATION ALGORITHM FOR
MATRICES OF ARBITRARY ORDER
(COMMUNICATED BY MARTIN HERMANN)
IVO HEDTKE
Abstract. The well known algorithm of Volker Strassen for matrix multiplication can only be used for (๐2๐ ร ๐2๐ ) matrices. For arbitrary (๐ ร ๐)
matrices one has to add zero rows and columns to the given matrices to use
Strassenโs algorithm. Strassen gave a strategy of how to set ๐ and ๐ for
arbitrary ๐ to ensure ๐ โค ๐2๐ . In this paper we study the number ๐ of additional zero rows and columns and the in๏ฌuence on the number of ๏ฌops used
by the algorithm in the worst case (๐ = ๐/16), best case (๐ = 1) and in the
average case (๐ โ ๐/48). The aim of this work is to give a detailed analysis
of the number of additional zero rows and columns and the additional work
caused by Strassenโs bad parameters. Strassen used the parameters ๐ and
๐ to show that his matrix multiplication algorithm needs less than 4.7๐log2 7
๏ฌops. We can show in this paper, that these parameters cause an additional
work of approximately 20 % in the worst case in comparison to the optimal
strategy for the worst case. This is the main reason for the search for better
parameters.
1. Introduction
In his paper โGaussian Elimination is not Optimal โ ([14]) Volker Strassen
developed a recursive algorithm (we will call it ๐ฎ) for multiplication of square
matrices of order ๐2๐ . The algorithm itself is described below. Further details can
be found in [8, p. 31].
Before we start with our analysis of the parameters of Strassenโs algorithm
we will have a short look on the history of fast matrix multiplication. The naive
algorithm for matrix multiplication is an ๐ช(๐3 ) algorithm. In 1969 Strassen
showed that there is an ๐ช(๐2.81 ) algorithm for this problem. Shmuel Winograd
optimized Strassenโs algorithm. While the Strassen-Winograd algorithm is a
variant that is always implemented (for example in the famous GEMMW package),
there are faster ones (in theory) that are impractical to implement. The fastest
known algorithm, devised in 1987 by Don Coppersmith and Winograd, runs in
1991 Mathematics Subject Classi๏ฌcation. Primary 65F30; Secondary 68Q17.
Key words and phrases. Fast Matrix Multiplication, Strassen Algorithm.
c
โ2011
Universiteti i Prishtineฬs, Prishtineฬ, Kosoveฬ.
The author was supported by the Studienstiftung des Deutschen Volkes.
Submitted February 2, 2011. Published March 2, 2011.
269
270
I. HEDTKE
๐ช(๐2.38 ) time. There is also an interesting group-theoretic approach to fast matrix
multiplication from Henry Cohn and Christopher Umans, see [4], [5] and [13].
Most researchers believe that an optimal algorithm with ๐ช(๐2 ) runtime exists, since
then no further progress was made in ๏ฌnding one.
Because modern architectures have complex memory hierarchies and increasing
parallelism, performance has become a complex tradeo๏ฌ, not just a simple matter
of counting ๏ฌops (in this article one ๏ฌop means one ๏ฌoating-point operation, that
means one addition is a ๏ฌop and one multiplication is one ๏ฌop, too). Algorithms
which make use of this technology were described by Paolo DโAlberto and
Alexandru Nicolau in [1]. An also well known method is Tiling: The normal
algorithm can be speeded up by a factor of two by using a six loop implementation
that blocks submatrices so that the data passes through the L1 Cache only once.
1.1. The algorithm. Let ๐ด and ๐ต be (๐2๐ ร๐2๐ ) matrices. To compute ๐ถ := ๐ด๐ต
let
]
]
[
[
๐ต11 ๐ต12
๐ด11 ๐ด12
,
and ๐ต =
๐ด=
๐ต21 ๐ต22
๐ด21 ๐ด22
where ๐ด๐๐ and ๐ต๐๐ are matrices of order ๐2๐โ1 . With the following auxiliary
matrices
๐ป1 := (๐ด11 + ๐ด22 )(๐ต11 + ๐ต22 )
๐ป2 := (๐ด21 + ๐ด22 )๐ต11
๐ป3 := ๐ด11 (๐ต12 โ ๐ต22 )
๐ป4 := ๐ด22 (๐ต21 โ ๐ต11 )
๐ป5 := (๐ด11 + ๐ด12 )๐ต22
๐ป6 := (๐ด21 โ ๐ด11 )(๐ต11 + ๐ต12 )
๐ป7 := (๐ด12 โ ๐ด22 )(๐ต21 + ๐ต22 )
we get
๐ถ=
[
๐ป 1 + ๐ป4 โ ๐ป5 + ๐ป 7
๐ป2 + ๐ป4
]
๐ป3 + ๐ป 5
.
๐ป1 + ๐ป3 โ ๐ป 2 + ๐ป6
This leads to recursive computation. In the last step of the recursion the products
of the (๐ ร ๐) matrices are computed with the naive algorithm (straight forward
implementation with three for-loops, we will call it ๐ฉ ).
1.2. Properties of the algorithm. The algorithm ๐ฎ needs (see [14])
๐น๐ฎ (๐, ๐) := 7๐ ๐2 (2๐ + 5) โ 4๐ 6๐2
๏ฌops to compute the product of two square matrices of order ๐2๐ . The naive
algorithm ๐ฉ needs ๐3 multiplications and ๐3 โ๐2 additions to compute the product
of two (๐ ร ๐) matrices. Therefore ๐น๐ฉ (๐) = 2๐3 โ ๐2 . In the case ๐ = 2๐ the
algorithm ๐ฎ is better than ๐ฉ if ๐น๐ฎ (1, ๐) < ๐น๐ฉ (2๐ ), which is the case i๏ฌ ๐ โฉพ 10.
But if we use algorithm ๐ฎ only for matrices of order at least 210 = 1024, we get a
new problem:
๐
๐
Lemma 1.1. The algorithm ๐ฎ needs 17
3 (7 โ 4 ) units of memory (we write โuomโ
in short) (number of floats or doubles) to compute the product of two (2๐ ร 2๐ )
matrices.
Proof. Let ๐ (๐) be the number of uom used by ๐ฎ to compute the product of
matrices of order ๐. The matrices ๐ด๐๐ , ๐ต๐๐ and ๐ปโ need 15(๐/2)2 uom. During the
computation of the auxiliary matrices ๐ปโ we need 7๐ (๐/2) uom and 2(๐/2)2 uom
STRASSENโS ALGORITHM FOR MATRICES OF ARBITRARY SIZE
271
as input arguments for the recursive calls of ๐ฎ. Therefore we get ๐ (๐) = 7๐ (๐/2)+
๐
๐
โก
(17/4)๐2 . Together with ๐ (1) = 0 this yields to ๐ (2๐ ) = 17
3 (7 โ 4 ).
As an example, if we compute the product of two (210 ร210 ) matrices (represented
10
as double arrays) with ๐ฎ we need 8 โ 17
โ 410 ) bytes, i.e. 12.76 gigabytes of
3 (7
memory. That is an enormous amount of RAM for such a problem instance. Brice
Boyer et al. ([3]) solved this problem with fully in-place schedules of StrassenWinogradโs algorithm (see the following paragraph), if the input matrices can be
overwritten.
Shmuel Winograd optimized Strassenโs algorithm. The Strassen-Winograd algorithm (described in [11]) needs only 15 additions and subtractions, whereas
๐ฎ needs 18. Winograd had also shown (see [15]), that the minimum number of
multiplications required to multiply 2 ร 2 matrices is 7. Furthermore, Robert
Probert ([12]) showed that 15 additive operations are necessary and su๏ฌcient to
multiply two 2 ร 2 matrices with 7 multiplications.
Because of the bad properties of ๐ฎ with full recursion and large matrices, one
can study the idea to use only one step of recursion. If ๐ is even and we use one step
of recursion of ๐ฎ (for the remaining products we use ๐ฉ ) the ratio of this operation
count to that required by ๐ฉ is (see [9])
7๐3 + 11๐2
8๐3 โ 4๐2
๐โโ
โโโโโ
7
.
8
Therefore the multiplication of two su๏ฌciently large matrices using ๐ฎ costs approximately 12.5 % less than using ๐ฉ .
Using the technique of stopping the recursion in the Strassen-Winograd algorithm early, there are well known implementations, as for example
โ on the Cray-2 from David Bailey ([2]),
โ GEMMW from Douglas et al. ([7]) and
โ a routine in the IBM ESSL library routine ([10]).
1.3. The aim of this work. Strassenโs algorithm can only be used for (๐2๐ ร
๐2๐ ) matrices. For arbitrary (๐ร๐) matrices one has to add zero rows and columns
to the given matrices (see the next section) to use Strassenโs algorithm. Strassen
gave a strategy of how to set ๐ and ๐ for arbitrary ๐ to ensure ๐ โค ๐2๐ . In this
paper we study the number ๐ of additional zero rows and columns and the in๏ฌuence
on the number of ๏ฌops used by the algorithm in the worst case, best case and in
the average case.
It is known ([11]), that these parameters are not optimal. We only study the
number ๐ and the additional work caused by the bad parameters of Strassen.
We give no better strategy of how to set ๐ and ๐, and we do not analyze other
strategies than the one from Strassen.
2. Strassenโs parameter for matrices of arbitrary order
Algorithm ๐ฎ uses recursions to multiply matrices of order ๐2๐ . If ๐ = 0 then ๐ฎ
coincides with the naive algorithm ๐ฉ . So we will only consider the case where
๐ > 0. To use ๐ฎ for arbitrary (๐ ร ๐) matrices ๐ด and ๐ต (that means for arbitrary
ห which are both (ห
๐) we have to embed them into matrices ๐ดห and ๐ต
๐ร๐
ห ) matrices
๐
with ๐
ห := ๐2 โฉพ ๐. We do this by adding โ := ๐
ห โ ๐ zero rows and colums to ๐ด
272
I. HEDTKE
and ๐ต. This results in
ห=
๐ดห๐ต
[
๐ด
0โร๐
0๐รโ
0โรโ
][
]
0๐รโ
ห
=: ๐ถ,
0โรโ
๐ต
0โร๐
where 0๐ร๐ denotes the (๐ ร ๐) zero matrix. If we delete the last โ columns and
rows of ๐ถห we get the result ๐ถ = ๐ด๐ต.
We now focus on how to ๏ฌnd ๐ and ๐ for arbitrary ๐ with ๐ โฉฝ ๐2๐ . An optimal
but purely theoretical choice is
(๐โ , ๐ โ ) = arg min{๐น๐ฎ (๐, ๐) : (๐, ๐) โ โ ร โ0 , ๐ โฉฝ ๐2๐ }.
Further methods of ๏ฌnding ๐ and ๐ can be found in [11]. We choose another way.
According to Strassenโs proof of the main result of [14], we de๏ฌne
๐ := โlog2 ๐โ โ 4
๐ := โ๐2โ๐ โ + 1,
and
(2.1)
where โ๐ฅโ denotes the largest integer not greater than ๐ฅ. We de๏ฌne ๐
ห := ๐2๐ and
study the relationship between ๐ and ๐
ห . The results are:
โ worst case: ๐
ห โฉฝ (17/16)๐,
โ best case: ๐
ห โฉพ ๐ + 1 and
โ average case: ๐
ห โ (49/48)๐.
2.1. Worst case analysis.
Theorem 2.1. Let ๐ โ โ with ๐ โฉพ 16. For the parameters (2.1) and ๐2๐ = ๐
ห we
have
17
๐.
๐
หโค
16
If ๐ is a power of two, we have ๐
ห=
17
16 ๐.
Proof. For ๏ฌxed ๐ there is exactly one ๐ผ โ โ with 2๐ผ โค ๐ < 2๐ผ+1 . We de๏ฌne
๐ผ ๐ผ := {2๐ผ , . . . , 2๐ผ+1 โ 1}. Because of (2.1) for each ๐ โ ๐ผ ๐ผ the value of ๐ is
๐ = โlog2 ๐โ โ 4 = log2 2๐ผ โ 4 = ๐ผ โ 4.
Possible values for ๐ are
โ
โ
โ
โ
1
16
๐ = ๐ ๐ผโ4 + 1 = ๐ ๐ผ + 1 =: ๐(๐).
2
2
๐(๐) is increasing in ๐ and ๐(2๐ผ ) = 17 and ๐(2๐ผ+1 ) = 33. Therefore we have
๐ โ {17, . . . , 32}. For each ๐ โ ๐ผ ๐ผ one of the following inequalities holds:
(๐ผ1๐ผ )
(๐ผ2๐ผ )
2๐ผ = 16 โ 2๐ผโ4
17 โ 2๐ผโ4
โค
โค
๐
๐
..
.
<
<
17 โ 2๐ผโ4
18 โ 2๐ผโ4
๐ผ
(๐ผ16
)
31 โ 2๐ผโ4 โค ๐ < 32 โ 2๐ผโ4 = 2๐ผ+1 .
โ16
Note that ๐ผ ๐ผ = ๐=1 ๐ผ๐๐ผ . It follows, that for all ๐ โ โ there exists exactly one ๐ผ
with ๐ โ ๐ผ ๐ผ and for all ๐ โ ๐ผ ๐ผ there is exactly one ๐ with ๐ โ ๐ผ๐๐ผ .
Note that for all ๐ โ ๐ผ๐๐ผ we have ๐ = ๐ผ โ 4 and ๐(๐) = ๐ + 16. If we only focus
on ๐ผ๐๐ผ the di๏ฌerence ๐
ห โ ๐ has its maximum at the lower end of ๐ผ๐๐ผ (ห
๐ is constant
STRASSENโS ALGORITHM FOR MATRICES OF ARBITRARY SIZE
273
๐น๐ฎ (2๐ , 10 โ ๐)
2 ร 109
1 ร 109
๐
0
1
2
3
4
5
6
7
8
9
10
Figure 1. Di๏ฌerent parameters (๐ = 2๐ , ๐ = 10 โ ๐, ๐ = ๐2๐ )
to apply in the Strassen algorithm for matrices of order 210 .
ห and the
and ๐ has its minimum at the lower end of ๐ผ๐๐ผ ). On ๐ผ๐๐ผ the value of ๐
minimum of ๐ are
๐ผโ4
๐
ห๐ผ
๐ := (16 + ๐) โ 2
and
๐ผโ4
๐๐ผ
.
๐ := (15 + ๐) โ 2
๐ผ
Therefore the di๏ฌerence ๐๐ผ
ห๐ผ
๐ := ๐
๐ โ ๐๐ is constant:
๐ผโ4
๐๐ผ
โ (15 + ๐) โ 2๐ผโ4 = 2๐ผโ4
๐ = (16 + ๐) โ 2
for all ๐.
To set this in relation with ๐ we study
๐๐๐ผ :=
๐ผ
๐
ห๐ผ
๐๐ผ
๐๐ผ
2๐ผโ4
๐
๐ + ๐๐
๐
=
=
1
+
=
1
+
.
๐๐ผ
๐๐ผ
๐๐ผ
๐๐ผ
๐
๐
๐
๐
๐ผ
๐ผโ4
Finally ๐๐๐ผ is maximal, i๏ฌ ๐๐ผ
= 2๐ผ .
๐ is minimal, which is the case for ๐1 = 16 โ 2
17
With ๐1๐ผ = 17/16 we get ๐
ห โค 16
๐, which completes the proof.
โก
Now we want to use the result above to take a look at the number of ๏ฌops we
need for ๐ฎ in the worst case. The worst case is ๐ = 2๐ for any 4 โค ๐ โ โ. An
optimal decomposition (in the sense of minimizing the number (ห
๐ โ ๐) of zero rows
and columns we add to the given matrices) is ๐ = 2๐ and ๐ = ๐ โ ๐, because
๐2๐ = 2๐ 2๐โ๐ = 2๐ = ๐. Note that these parameters ๐ and ๐ have nothing to do
with equation (2.1). Lets have a look on the in๏ฌuence of ๐:
Lemma 2.2. Let ๐ = 2๐ . In the decomposition ๐ = ๐2๐ we use ๐ = 2๐ and
๐ = ๐ โ ๐. Then ๐ (๐) := ๐น๐ฎ (2๐ , ๐ โ ๐) has its minimum at ๐ = 3.
Proof. We have ๐ (๐) = 2 โ 7๐ (8/7)๐ + 5 โ 7๐ (4/7)๐ โ 4๐ 6. Thus
๐ (๐ + 1) โ ๐ (๐)
= [2 โ 7๐ (8/7)๐+1 + 5 โ 7๐ (4/7)๐+1 โ 4๐ 6] โ [2 โ 7๐ (8/7)๐ + 5 โ 7๐ (4/7)๐ โ 4๐ 6]
= 2 โ 7๐ (8/7)๐ (8/7 โ 1) + 5 โ 7๐ (4/7)๐ (4/7 โ 1)
= 2 โ 7๐ (8/7)๐ โ 1/7 โ 15 โ 7๐ (4/7)๐ โ 1/7 = 2(4/7)๐ 7๐โ1 (2๐ โ 7.5).
Therefore, ๐ (๐) is a minimum if ๐ = min{๐ : 2๐ โ 7.5 > 0} = 3.
โก
274
I. HEDTKE
๐1 (๐)
๐2 (๐)
1.21
1.008
1.20
1.007
1.19
1.006
๐
๐
4 6 8 10 12 14 16 18 20
4 6 8 10 12 14 16 18 20
Figure 2. Comparison of Strassenโs parameters (๐1 , ๐ = 17,
๐ = ๐ โ 4), obviously better parameters (๐2 , ๐ = 16, ๐ = ๐ โ 4)
and the optimal parameters (lemma, ๐ = 8, ๐ = ๐ โ 3) for the
worst case ๐ = 2๐ . Limits of ๐๐ are dashed.
Figure 1 shows that it is not optimal to use ๐ฎ with full recursion in the example
๐ = 10. Now we study the worst case ๐ = 2๐ and di๏ฌerent sets of parameters ๐
and ๐ for ๐น๐ฎ (๐, ๐):
(1) If we use equation (2.1), we get the original parameters of Strassen: ๐ =
๐ โ 4 and ๐ = 17. Therefore we de๏ฌne
๐น1 (๐) := ๐น๐ฎ (17, ๐ โ 4) = 7๐
2
39 โ 172
๐ 6 โ 17
โ
4
.
74
44
(2) Obviously ๐ = 16 would be a better choice, because with this we get
๐2๐ = ๐ (we avoid the additional zero rows and columns). Now we de๏ฌne
๐น2 (๐) := ๐น๐ฎ (16, ๐ โ 4) = 7๐
37 โ 162
โ 4๐ 6.
74
(3) Finally we use the lemma above. With ๐ = 8 = 23 and ๐ = ๐ โ 3 we get
๐น3 (๐) := ๐น๐ฎ (8, ๐ โ 3) = 7๐
3 โ 64
โ 4๐ 6.
49
Now we analyze ๐น๐ relative to each other. Therefore we de๏ฌne ๐๐ := ๐น๐ /๐น3
(๐ = 1, 2). So we have ๐๐ : {4, 5, . . .} โ โ, which is monotonously decreasing in ๐.
With
๐1 (๐) =
289(4๐ โ 2401 โ 7๐ โ 1664)
12544(4๐ โ 49 โ 7๐ โ 32)
and
๐2 (๐) =
4๐ โ 7203 โ 7๐ โ 4736
147(4๐ โ 49 โ 7๐ โ 32)
we get
๐1 (4) = 3179/2624 โ 1.2115
๐2 (4) = 124/123 โ 1.00813
lim ๐1 (๐) = 3757/3136 โ 1.1980
๐โโ
lim ๐2 (๐) = 148/147 โ 1.00680.
๐โโ
Figure 2 shows the graphs of the functions ๐๐ . In conclusion, in the worst case the
parameters of Strassen need approx. 20 % more ๏ฌops than the optimal parameters
of the lemma.
STRASSENโS ALGORITHM FOR MATRICES OF ARBITRARY SIZE
275
2.2. Best case analysis.
Theorem 2.3. Let ๐ โ โ with ๐ โฉพ 16. For the parameters (2.1) and ๐2๐ = ๐
ห we
have
๐+1โค๐
ห.
๐
If ๐ = 2 โ โ 1, โ โ {16, . . . , 31}, we have ๐
ห = ๐ + 1.
Proof. Like in Theorem 2.1 we have for each ๐ผ๐๐ผ a constant value for ๐
ห๐ผ
๐ namely
๐ผโ4
2
(16 + ๐). Therefore ๐ < ๐
ห holds. The di๏ฌerence ๐
ห โ ๐ has its minimum at the
ห โ ๐ = 2๐ผโ4 (16 + ๐) โ (2๐ผโ4 (16 + ๐) โ 1) = 1.
upper end of ๐ผ๐๐ผ . There we have ๐
This shows ๐ + 1 โค ๐
ห.
โก
Let us focus on the ๏ฌops we need for ๐ฎ, again. Lets have a look at the example
๐ = 2๐ โ 1. The original parameters (see equation (2.1)) for ๐ฎ are ๐ = ๐ โ 5 and
๐ = 32. Accordingly we de๏ฌne ๐น (๐) := ๐น๐ฎ (32, ๐ โ 5). Because 2๐ โ 1 โค 2๐ we can
add 1 zero row and column and use the lemma from the worst case. Now we get
the parameters ๐ = 8 and ๐ = ๐ โ 3 and de๏ฌne ๐นห (๐) := ๐น๐ฎ (8, ๐ โ 3). To analyze
๐น and ๐นห relative to each other we have
4๐ โ 16807 โ 7๐ โ 11776
๐น (๐)
.
= ๐
๐(๐) :=
4 โ 16807 โ 7๐ โ 10976
๐นห (๐)
Note that ๐ : {5, 6, . . .} โ โ is monotonously decreasing in ๐ and has its maximum
at ๐ = 5. We get
๐(5) = 336/311 โ 1.08039
lim ๐(๐) = 11776/10976 โ 1.07289.
๐โโ
Therefore we can say: In the best case the parameters of Strassen are approx.
8 % worse than the optimal parameters from the lemma in the worst case.
2.3. Average case analysis. With ๐ผห
๐ we denote the expected value of ๐
ห . We
search for a relationship like ๐
ห โ ๐พ๐ for ๐พ โ โ. That means ๐ผ[ห
๐/๐] = ๐พ.
Theorem 2.4. For the parameters (2.1) of Strassen ๐2๐ = ๐
ห we have
49
๐.
๐ผห
๐=
48
๐ผ
Proof. First we focus only on ๐ผ ๐ผ . We write ๐ผ๐ผ := ๐ผโฃ๐ผ ๐ผ and ๐ผ๐ผ
๐ := ๐ผโฃ๐ผ๐ for the
๐ผ
๐ผ
expected value on ๐ผ and ๐ผ๐ , resp. We have
16
๐ผ๐ผ ๐
ห=
16
16
1 โ ๐ผ
1 โ ๐ผ
1 โ
๐ผ๐ ๐
ห=
๐
ห๐ =
(๐ + 16)2๐ผโ4 = 2๐ผโ5 โ 49.
16 ๐=1
16 ๐=1
16 ๐=1
Together with ๐ผ๐ผ ๐ = 21 [(2๐ผ+1 โ 1) + 2๐ผ ] = 2๐ผ + 2๐ผโ1 โ 1/2, we get
[ ]
๐ผ๐ผ ๐
ห
2๐ผโ5 โ 49
๐
ห
= ๐ผ = ๐ผ
.
๐(๐ผ) := ๐ผ๐ผ
๐
๐ผ ๐
2 + 2๐ผโ1 โ 1/2
โ๐
Now we want to calculate ๐ผ๐ := ๐ผโฃ๐ (๐) [ห
๐/๐], where ๐ (๐) := ๐=0 ๐ผ 4+๐ by using the
values ๐(๐). Because of โฃ๐ผ 5 โฃ = 2โฃ๐ผ 4 โฃ and โฃ๐ผ 4 โช๐ผ 5 โฃ = 3โฃ๐ผ 4 โฃ we have ๐ผ1 = 31 ๐(4)+ 32 ๐(5).
With the same argument we get
๐ผ๐ =
4+๐
โ
๐=4
๐ฝ๐ ๐(๐)
where
๐ฝ๐ =
2๐โ4
.
โ1
2๐+1
276
I. HEDTKE
Finally we have
( 4+๐
)
[ ]
โ
๐
ห
๐ผ
= lim ๐ผ๐ = lim
๐ฝ๐ ๐(๐)
๐โโ
๐โโ
๐
๐=4
(
)
4+๐
โ
49
49
22๐โ9
=
,
= lim
๐โโ
2๐+1 โ 1 ๐=4 2๐ + 2๐โ1 โ 1/2
48
what we intended to show.
Compared to the worst case (ห
๐โค
1
1 + 1/48 = 1 + 13 โ 16
.
17
16 ๐,
โก
17/16 = 1 + 1/16), note that 49/48 =
3. Conclusion
Strassen used the parameters ๐ and ๐ in the form (2.1) to show that his matrix
multiplication algorithm needs less than 4.7๐log2 7 ๏ฌops. We could show in this
paper, that these parameters cause an additional work of approx. 20 % in the worst
case in comparison to the optimal strategy for the worst case. This is the main
reason for the search for better parameters, like in [11].
References
[1] P. DโAlberto and A. Nicolau, Adaptive Winogradโs Matrix Multiplications, ACM Trans.
Math. Softw. 36, 1, Article 3 (March 2009).
[2] D. H. Bailey, Extra High Speed Matrix Multiplication on the Cray-2, SIAM J. Sci. Stat.
Comput., Vol. 9, Co. 3, 603โ607, 1988.
[3] B. Boyer, J.-G. Dumas, C. Pernet, and W. Zhou, Memory e๏ฌcient scheduling of StrassenWinogradโs matrix multiplication algorithm, International Symposium on Symbolic and Algebraic Computation 2009, Seฬoul.
[4] H. Cohn and C. Umans, A Group-theoretic Approach to Fast Matrix Multiplication, Proceedings of the 44th Annual Symposium on Foundations of Computer Science, 11-14 October
2003, Cambridge, MA, IEEE Computer Society, pp. 438-449.
[5] H. Cohn, R. Kleinberg, B. Szegedy and C. Umans, Group-theoretic Algorithms for Matrix
Multiplication, Proceedings of the 46th Annual Symposium on Foundations of Computer
Science, 23-25 October 2005, Pittsburgh, PA, IEEE Computer Society, pp. 379-388.
[6] D. Coppersmith and S. Winograd, Matrix Multiplication via Arithmetic Progressions, STOC
โ87: Proceedings of the nineteenth annual ACM symposium on Theory of Computing, 1987.
[7] C. C. Douglas, M. Heroux, G. Slishman and R. M. Smith, GEMMW: A Portable Level
3 BLAS Winograd Variant of Strassenโs MatrixโMatrix Multiply Algorithm, J. of Comp.
Physics, 110:1โ10, 1994.
[8] G. H. Golub and C. F. Van Loan, Matrix Computations, Third ed., The Johns Hopkins
University Press, Baltimore, MD, 1996.
[9] S. Huss-Lederman, E. M. Jacobson, J. R. Johnson, A. Tsao, and T. Turnbull, Implementation
of Strassenโs Algorithm for Matrix Multiplication, 0-89791-854-1, 1996 IEEE
[10] IBM Engineering and Scienti๏ฌc Subroutine Library Guide and Reference, 1992. Order No.
SC23-0526.
[11] P. C. Fischer and R. L. Probert, E๏ฌcient Procedures for using Matrix Algorithms, Automata,
Languages and Programming โ 2nd Colloquium, University of Saarbruฬcken, Lecture Notes
in Computer Science, 1974.
[12] R. L. Probert, On the additive complexity of matrix multiplication, SIAM J. Compu, 5:187โ
203, 1976.
[13] S. Robinson, Toward an Optimal Algorithm for Matrix Multiplication, SIAM News, Volume
38, Number 9, November 2005.
[14] V. Strassen, Gaussian Elimination is not Optimal, Numer. Math., 13:354โ356, 1969.
[15] S. Winograd, On Multiplication of 2 ร 2 Matrices, Linear Algebra and its Applications,
4:381-388, 1971.
STRASSENโS ALGORITHM FOR MATRICES OF ARBITRARY SIZE
Mathematical Institute, University of Jena, D-07737 Jena, Germany
E-mail address: Ivo.Hedtke@uni-jena.de
277
ISSN: 1821-1291, URL: http://www.bmathaa.org
Volume 3 Issue 2(2011), Pages 269-277.
STRASSENโS MATRIX MULTIPLICATION ALGORITHM FOR
MATRICES OF ARBITRARY ORDER
(COMMUNICATED BY MARTIN HERMANN)
IVO HEDTKE
Abstract. The well known algorithm of Volker Strassen for matrix multiplication can only be used for (๐2๐ ร ๐2๐ ) matrices. For arbitrary (๐ ร ๐)
matrices one has to add zero rows and columns to the given matrices to use
Strassenโs algorithm. Strassen gave a strategy of how to set ๐ and ๐ for
arbitrary ๐ to ensure ๐ โค ๐2๐ . In this paper we study the number ๐ of additional zero rows and columns and the in๏ฌuence on the number of ๏ฌops used
by the algorithm in the worst case (๐ = ๐/16), best case (๐ = 1) and in the
average case (๐ โ ๐/48). The aim of this work is to give a detailed analysis
of the number of additional zero rows and columns and the additional work
caused by Strassenโs bad parameters. Strassen used the parameters ๐ and
๐ to show that his matrix multiplication algorithm needs less than 4.7๐log2 7
๏ฌops. We can show in this paper, that these parameters cause an additional
work of approximately 20 % in the worst case in comparison to the optimal
strategy for the worst case. This is the main reason for the search for better
parameters.
1. Introduction
In his paper โGaussian Elimination is not Optimal โ ([14]) Volker Strassen
developed a recursive algorithm (we will call it ๐ฎ) for multiplication of square
matrices of order ๐2๐ . The algorithm itself is described below. Further details can
be found in [8, p. 31].
Before we start with our analysis of the parameters of Strassenโs algorithm
we will have a short look on the history of fast matrix multiplication. The naive
algorithm for matrix multiplication is an ๐ช(๐3 ) algorithm. In 1969 Strassen
showed that there is an ๐ช(๐2.81 ) algorithm for this problem. Shmuel Winograd
optimized Strassenโs algorithm. While the Strassen-Winograd algorithm is a
variant that is always implemented (for example in the famous GEMMW package),
there are faster ones (in theory) that are impractical to implement. The fastest
known algorithm, devised in 1987 by Don Coppersmith and Winograd, runs in
1991 Mathematics Subject Classi๏ฌcation. Primary 65F30; Secondary 68Q17.
Key words and phrases. Fast Matrix Multiplication, Strassen Algorithm.
c
โ2011
Universiteti i Prishtineฬs, Prishtineฬ, Kosoveฬ.
The author was supported by the Studienstiftung des Deutschen Volkes.
Submitted February 2, 2011. Published March 2, 2011.
269
270
I. HEDTKE
๐ช(๐2.38 ) time. There is also an interesting group-theoretic approach to fast matrix
multiplication from Henry Cohn and Christopher Umans, see [4], [5] and [13].
Most researchers believe that an optimal algorithm with ๐ช(๐2 ) runtime exists, since
then no further progress was made in ๏ฌnding one.
Because modern architectures have complex memory hierarchies and increasing
parallelism, performance has become a complex tradeo๏ฌ, not just a simple matter
of counting ๏ฌops (in this article one ๏ฌop means one ๏ฌoating-point operation, that
means one addition is a ๏ฌop and one multiplication is one ๏ฌop, too). Algorithms
which make use of this technology were described by Paolo DโAlberto and
Alexandru Nicolau in [1]. An also well known method is Tiling: The normal
algorithm can be speeded up by a factor of two by using a six loop implementation
that blocks submatrices so that the data passes through the L1 Cache only once.
1.1. The algorithm. Let ๐ด and ๐ต be (๐2๐ ร๐2๐ ) matrices. To compute ๐ถ := ๐ด๐ต
let
]
]
[
[
๐ต11 ๐ต12
๐ด11 ๐ด12
,
and ๐ต =
๐ด=
๐ต21 ๐ต22
๐ด21 ๐ด22
where ๐ด๐๐ and ๐ต๐๐ are matrices of order ๐2๐โ1 . With the following auxiliary
matrices
๐ป1 := (๐ด11 + ๐ด22 )(๐ต11 + ๐ต22 )
๐ป2 := (๐ด21 + ๐ด22 )๐ต11
๐ป3 := ๐ด11 (๐ต12 โ ๐ต22 )
๐ป4 := ๐ด22 (๐ต21 โ ๐ต11 )
๐ป5 := (๐ด11 + ๐ด12 )๐ต22
๐ป6 := (๐ด21 โ ๐ด11 )(๐ต11 + ๐ต12 )
๐ป7 := (๐ด12 โ ๐ด22 )(๐ต21 + ๐ต22 )
we get
๐ถ=
[
๐ป 1 + ๐ป4 โ ๐ป5 + ๐ป 7
๐ป2 + ๐ป4
]
๐ป3 + ๐ป 5
.
๐ป1 + ๐ป3 โ ๐ป 2 + ๐ป6
This leads to recursive computation. In the last step of the recursion the products
of the (๐ ร ๐) matrices are computed with the naive algorithm (straight forward
implementation with three for-loops, we will call it ๐ฉ ).
1.2. Properties of the algorithm. The algorithm ๐ฎ needs (see [14])
๐น๐ฎ (๐, ๐) := 7๐ ๐2 (2๐ + 5) โ 4๐ 6๐2
๏ฌops to compute the product of two square matrices of order ๐2๐ . The naive
algorithm ๐ฉ needs ๐3 multiplications and ๐3 โ๐2 additions to compute the product
of two (๐ ร ๐) matrices. Therefore ๐น๐ฉ (๐) = 2๐3 โ ๐2 . In the case ๐ = 2๐ the
algorithm ๐ฎ is better than ๐ฉ if ๐น๐ฎ (1, ๐) < ๐น๐ฉ (2๐ ), which is the case i๏ฌ ๐ โฉพ 10.
But if we use algorithm ๐ฎ only for matrices of order at least 210 = 1024, we get a
new problem:
๐
๐
Lemma 1.1. The algorithm ๐ฎ needs 17
3 (7 โ 4 ) units of memory (we write โuomโ
in short) (number of floats or doubles) to compute the product of two (2๐ ร 2๐ )
matrices.
Proof. Let ๐ (๐) be the number of uom used by ๐ฎ to compute the product of
matrices of order ๐. The matrices ๐ด๐๐ , ๐ต๐๐ and ๐ปโ need 15(๐/2)2 uom. During the
computation of the auxiliary matrices ๐ปโ we need 7๐ (๐/2) uom and 2(๐/2)2 uom
STRASSENโS ALGORITHM FOR MATRICES OF ARBITRARY SIZE
271
as input arguments for the recursive calls of ๐ฎ. Therefore we get ๐ (๐) = 7๐ (๐/2)+
๐
๐
โก
(17/4)๐2 . Together with ๐ (1) = 0 this yields to ๐ (2๐ ) = 17
3 (7 โ 4 ).
As an example, if we compute the product of two (210 ร210 ) matrices (represented
10
as double arrays) with ๐ฎ we need 8 โ 17
โ 410 ) bytes, i.e. 12.76 gigabytes of
3 (7
memory. That is an enormous amount of RAM for such a problem instance. Brice
Boyer et al. ([3]) solved this problem with fully in-place schedules of StrassenWinogradโs algorithm (see the following paragraph), if the input matrices can be
overwritten.
Shmuel Winograd optimized Strassenโs algorithm. The Strassen-Winograd algorithm (described in [11]) needs only 15 additions and subtractions, whereas
๐ฎ needs 18. Winograd had also shown (see [15]), that the minimum number of
multiplications required to multiply 2 ร 2 matrices is 7. Furthermore, Robert
Probert ([12]) showed that 15 additive operations are necessary and su๏ฌcient to
multiply two 2 ร 2 matrices with 7 multiplications.
Because of the bad properties of ๐ฎ with full recursion and large matrices, one
can study the idea to use only one step of recursion. If ๐ is even and we use one step
of recursion of ๐ฎ (for the remaining products we use ๐ฉ ) the ratio of this operation
count to that required by ๐ฉ is (see [9])
7๐3 + 11๐2
8๐3 โ 4๐2
๐โโ
โโโโโ
7
.
8
Therefore the multiplication of two su๏ฌciently large matrices using ๐ฎ costs approximately 12.5 % less than using ๐ฉ .
Using the technique of stopping the recursion in the Strassen-Winograd algorithm early, there are well known implementations, as for example
โ on the Cray-2 from David Bailey ([2]),
โ GEMMW from Douglas et al. ([7]) and
โ a routine in the IBM ESSL library routine ([10]).
1.3. The aim of this work. Strassenโs algorithm can only be used for (๐2๐ ร
๐2๐ ) matrices. For arbitrary (๐ร๐) matrices one has to add zero rows and columns
to the given matrices (see the next section) to use Strassenโs algorithm. Strassen
gave a strategy of how to set ๐ and ๐ for arbitrary ๐ to ensure ๐ โค ๐2๐ . In this
paper we study the number ๐ of additional zero rows and columns and the in๏ฌuence
on the number of ๏ฌops used by the algorithm in the worst case, best case and in
the average case.
It is known ([11]), that these parameters are not optimal. We only study the
number ๐ and the additional work caused by the bad parameters of Strassen.
We give no better strategy of how to set ๐ and ๐, and we do not analyze other
strategies than the one from Strassen.
2. Strassenโs parameter for matrices of arbitrary order
Algorithm ๐ฎ uses recursions to multiply matrices of order ๐2๐ . If ๐ = 0 then ๐ฎ
coincides with the naive algorithm ๐ฉ . So we will only consider the case where
๐ > 0. To use ๐ฎ for arbitrary (๐ ร ๐) matrices ๐ด and ๐ต (that means for arbitrary
ห which are both (ห
๐) we have to embed them into matrices ๐ดห and ๐ต
๐ร๐
ห ) matrices
๐
with ๐
ห := ๐2 โฉพ ๐. We do this by adding โ := ๐
ห โ ๐ zero rows and colums to ๐ด
272
I. HEDTKE
and ๐ต. This results in
ห=
๐ดห๐ต
[
๐ด
0โร๐
0๐รโ
0โรโ
][
]
0๐รโ
ห
=: ๐ถ,
0โรโ
๐ต
0โร๐
where 0๐ร๐ denotes the (๐ ร ๐) zero matrix. If we delete the last โ columns and
rows of ๐ถห we get the result ๐ถ = ๐ด๐ต.
We now focus on how to ๏ฌnd ๐ and ๐ for arbitrary ๐ with ๐ โฉฝ ๐2๐ . An optimal
but purely theoretical choice is
(๐โ , ๐ โ ) = arg min{๐น๐ฎ (๐, ๐) : (๐, ๐) โ โ ร โ0 , ๐ โฉฝ ๐2๐ }.
Further methods of ๏ฌnding ๐ and ๐ can be found in [11]. We choose another way.
According to Strassenโs proof of the main result of [14], we de๏ฌne
๐ := โlog2 ๐โ โ 4
๐ := โ๐2โ๐ โ + 1,
and
(2.1)
where โ๐ฅโ denotes the largest integer not greater than ๐ฅ. We de๏ฌne ๐
ห := ๐2๐ and
study the relationship between ๐ and ๐
ห . The results are:
โ worst case: ๐
ห โฉฝ (17/16)๐,
โ best case: ๐
ห โฉพ ๐ + 1 and
โ average case: ๐
ห โ (49/48)๐.
2.1. Worst case analysis.
Theorem 2.1. Let ๐ โ โ with ๐ โฉพ 16. For the parameters (2.1) and ๐2๐ = ๐
ห we
have
17
๐.
๐
หโค
16
If ๐ is a power of two, we have ๐
ห=
17
16 ๐.
Proof. For ๏ฌxed ๐ there is exactly one ๐ผ โ โ with 2๐ผ โค ๐ < 2๐ผ+1 . We de๏ฌne
๐ผ ๐ผ := {2๐ผ , . . . , 2๐ผ+1 โ 1}. Because of (2.1) for each ๐ โ ๐ผ ๐ผ the value of ๐ is
๐ = โlog2 ๐โ โ 4 = log2 2๐ผ โ 4 = ๐ผ โ 4.
Possible values for ๐ are
โ
โ
โ
โ
1
16
๐ = ๐ ๐ผโ4 + 1 = ๐ ๐ผ + 1 =: ๐(๐).
2
2
๐(๐) is increasing in ๐ and ๐(2๐ผ ) = 17 and ๐(2๐ผ+1 ) = 33. Therefore we have
๐ โ {17, . . . , 32}. For each ๐ โ ๐ผ ๐ผ one of the following inequalities holds:
(๐ผ1๐ผ )
(๐ผ2๐ผ )
2๐ผ = 16 โ 2๐ผโ4
17 โ 2๐ผโ4
โค
โค
๐
๐
..
.
<
<
17 โ 2๐ผโ4
18 โ 2๐ผโ4
๐ผ
(๐ผ16
)
31 โ 2๐ผโ4 โค ๐ < 32 โ 2๐ผโ4 = 2๐ผ+1 .
โ16
Note that ๐ผ ๐ผ = ๐=1 ๐ผ๐๐ผ . It follows, that for all ๐ โ โ there exists exactly one ๐ผ
with ๐ โ ๐ผ ๐ผ and for all ๐ โ ๐ผ ๐ผ there is exactly one ๐ with ๐ โ ๐ผ๐๐ผ .
Note that for all ๐ โ ๐ผ๐๐ผ we have ๐ = ๐ผ โ 4 and ๐(๐) = ๐ + 16. If we only focus
on ๐ผ๐๐ผ the di๏ฌerence ๐
ห โ ๐ has its maximum at the lower end of ๐ผ๐๐ผ (ห
๐ is constant
STRASSENโS ALGORITHM FOR MATRICES OF ARBITRARY SIZE
273
๐น๐ฎ (2๐ , 10 โ ๐)
2 ร 109
1 ร 109
๐
0
1
2
3
4
5
6
7
8
9
10
Figure 1. Di๏ฌerent parameters (๐ = 2๐ , ๐ = 10 โ ๐, ๐ = ๐2๐ )
to apply in the Strassen algorithm for matrices of order 210 .
ห and the
and ๐ has its minimum at the lower end of ๐ผ๐๐ผ ). On ๐ผ๐๐ผ the value of ๐
minimum of ๐ are
๐ผโ4
๐
ห๐ผ
๐ := (16 + ๐) โ 2
and
๐ผโ4
๐๐ผ
.
๐ := (15 + ๐) โ 2
๐ผ
Therefore the di๏ฌerence ๐๐ผ
ห๐ผ
๐ := ๐
๐ โ ๐๐ is constant:
๐ผโ4
๐๐ผ
โ (15 + ๐) โ 2๐ผโ4 = 2๐ผโ4
๐ = (16 + ๐) โ 2
for all ๐.
To set this in relation with ๐ we study
๐๐๐ผ :=
๐ผ
๐
ห๐ผ
๐๐ผ
๐๐ผ
2๐ผโ4
๐
๐ + ๐๐
๐
=
=
1
+
=
1
+
.
๐๐ผ
๐๐ผ
๐๐ผ
๐๐ผ
๐
๐
๐
๐
๐ผ
๐ผโ4
Finally ๐๐๐ผ is maximal, i๏ฌ ๐๐ผ
= 2๐ผ .
๐ is minimal, which is the case for ๐1 = 16 โ 2
17
With ๐1๐ผ = 17/16 we get ๐
ห โค 16
๐, which completes the proof.
โก
Now we want to use the result above to take a look at the number of ๏ฌops we
need for ๐ฎ in the worst case. The worst case is ๐ = 2๐ for any 4 โค ๐ โ โ. An
optimal decomposition (in the sense of minimizing the number (ห
๐ โ ๐) of zero rows
and columns we add to the given matrices) is ๐ = 2๐ and ๐ = ๐ โ ๐, because
๐2๐ = 2๐ 2๐โ๐ = 2๐ = ๐. Note that these parameters ๐ and ๐ have nothing to do
with equation (2.1). Lets have a look on the in๏ฌuence of ๐:
Lemma 2.2. Let ๐ = 2๐ . In the decomposition ๐ = ๐2๐ we use ๐ = 2๐ and
๐ = ๐ โ ๐. Then ๐ (๐) := ๐น๐ฎ (2๐ , ๐ โ ๐) has its minimum at ๐ = 3.
Proof. We have ๐ (๐) = 2 โ 7๐ (8/7)๐ + 5 โ 7๐ (4/7)๐ โ 4๐ 6. Thus
๐ (๐ + 1) โ ๐ (๐)
= [2 โ 7๐ (8/7)๐+1 + 5 โ 7๐ (4/7)๐+1 โ 4๐ 6] โ [2 โ 7๐ (8/7)๐ + 5 โ 7๐ (4/7)๐ โ 4๐ 6]
= 2 โ 7๐ (8/7)๐ (8/7 โ 1) + 5 โ 7๐ (4/7)๐ (4/7 โ 1)
= 2 โ 7๐ (8/7)๐ โ 1/7 โ 15 โ 7๐ (4/7)๐ โ 1/7 = 2(4/7)๐ 7๐โ1 (2๐ โ 7.5).
Therefore, ๐ (๐) is a minimum if ๐ = min{๐ : 2๐ โ 7.5 > 0} = 3.
โก
274
I. HEDTKE
๐1 (๐)
๐2 (๐)
1.21
1.008
1.20
1.007
1.19
1.006
๐
๐
4 6 8 10 12 14 16 18 20
4 6 8 10 12 14 16 18 20
Figure 2. Comparison of Strassenโs parameters (๐1 , ๐ = 17,
๐ = ๐ โ 4), obviously better parameters (๐2 , ๐ = 16, ๐ = ๐ โ 4)
and the optimal parameters (lemma, ๐ = 8, ๐ = ๐ โ 3) for the
worst case ๐ = 2๐ . Limits of ๐๐ are dashed.
Figure 1 shows that it is not optimal to use ๐ฎ with full recursion in the example
๐ = 10. Now we study the worst case ๐ = 2๐ and di๏ฌerent sets of parameters ๐
and ๐ for ๐น๐ฎ (๐, ๐):
(1) If we use equation (2.1), we get the original parameters of Strassen: ๐ =
๐ โ 4 and ๐ = 17. Therefore we de๏ฌne
๐น1 (๐) := ๐น๐ฎ (17, ๐ โ 4) = 7๐
2
39 โ 172
๐ 6 โ 17
โ
4
.
74
44
(2) Obviously ๐ = 16 would be a better choice, because with this we get
๐2๐ = ๐ (we avoid the additional zero rows and columns). Now we de๏ฌne
๐น2 (๐) := ๐น๐ฎ (16, ๐ โ 4) = 7๐
37 โ 162
โ 4๐ 6.
74
(3) Finally we use the lemma above. With ๐ = 8 = 23 and ๐ = ๐ โ 3 we get
๐น3 (๐) := ๐น๐ฎ (8, ๐ โ 3) = 7๐
3 โ 64
โ 4๐ 6.
49
Now we analyze ๐น๐ relative to each other. Therefore we de๏ฌne ๐๐ := ๐น๐ /๐น3
(๐ = 1, 2). So we have ๐๐ : {4, 5, . . .} โ โ, which is monotonously decreasing in ๐.
With
๐1 (๐) =
289(4๐ โ 2401 โ 7๐ โ 1664)
12544(4๐ โ 49 โ 7๐ โ 32)
and
๐2 (๐) =
4๐ โ 7203 โ 7๐ โ 4736
147(4๐ โ 49 โ 7๐ โ 32)
we get
๐1 (4) = 3179/2624 โ 1.2115
๐2 (4) = 124/123 โ 1.00813
lim ๐1 (๐) = 3757/3136 โ 1.1980
๐โโ
lim ๐2 (๐) = 148/147 โ 1.00680.
๐โโ
Figure 2 shows the graphs of the functions ๐๐ . In conclusion, in the worst case the
parameters of Strassen need approx. 20 % more ๏ฌops than the optimal parameters
of the lemma.
STRASSENโS ALGORITHM FOR MATRICES OF ARBITRARY SIZE
275
2.2. Best case analysis.
Theorem 2.3. Let ๐ โ โ with ๐ โฉพ 16. For the parameters (2.1) and ๐2๐ = ๐
ห we
have
๐+1โค๐
ห.
๐
If ๐ = 2 โ โ 1, โ โ {16, . . . , 31}, we have ๐
ห = ๐ + 1.
Proof. Like in Theorem 2.1 we have for each ๐ผ๐๐ผ a constant value for ๐
ห๐ผ
๐ namely
๐ผโ4
2
(16 + ๐). Therefore ๐ < ๐
ห holds. The di๏ฌerence ๐
ห โ ๐ has its minimum at the
ห โ ๐ = 2๐ผโ4 (16 + ๐) โ (2๐ผโ4 (16 + ๐) โ 1) = 1.
upper end of ๐ผ๐๐ผ . There we have ๐
This shows ๐ + 1 โค ๐
ห.
โก
Let us focus on the ๏ฌops we need for ๐ฎ, again. Lets have a look at the example
๐ = 2๐ โ 1. The original parameters (see equation (2.1)) for ๐ฎ are ๐ = ๐ โ 5 and
๐ = 32. Accordingly we de๏ฌne ๐น (๐) := ๐น๐ฎ (32, ๐ โ 5). Because 2๐ โ 1 โค 2๐ we can
add 1 zero row and column and use the lemma from the worst case. Now we get
the parameters ๐ = 8 and ๐ = ๐ โ 3 and de๏ฌne ๐นห (๐) := ๐น๐ฎ (8, ๐ โ 3). To analyze
๐น and ๐นห relative to each other we have
4๐ โ 16807 โ 7๐ โ 11776
๐น (๐)
.
= ๐
๐(๐) :=
4 โ 16807 โ 7๐ โ 10976
๐นห (๐)
Note that ๐ : {5, 6, . . .} โ โ is monotonously decreasing in ๐ and has its maximum
at ๐ = 5. We get
๐(5) = 336/311 โ 1.08039
lim ๐(๐) = 11776/10976 โ 1.07289.
๐โโ
Therefore we can say: In the best case the parameters of Strassen are approx.
8 % worse than the optimal parameters from the lemma in the worst case.
2.3. Average case analysis. With ๐ผห
๐ we denote the expected value of ๐
ห . We
search for a relationship like ๐
ห โ ๐พ๐ for ๐พ โ โ. That means ๐ผ[ห
๐/๐] = ๐พ.
Theorem 2.4. For the parameters (2.1) of Strassen ๐2๐ = ๐
ห we have
49
๐.
๐ผห
๐=
48
๐ผ
Proof. First we focus only on ๐ผ ๐ผ . We write ๐ผ๐ผ := ๐ผโฃ๐ผ ๐ผ and ๐ผ๐ผ
๐ := ๐ผโฃ๐ผ๐ for the
๐ผ
๐ผ
expected value on ๐ผ and ๐ผ๐ , resp. We have
16
๐ผ๐ผ ๐
ห=
16
16
1 โ ๐ผ
1 โ ๐ผ
1 โ
๐ผ๐ ๐
ห=
๐
ห๐ =
(๐ + 16)2๐ผโ4 = 2๐ผโ5 โ 49.
16 ๐=1
16 ๐=1
16 ๐=1
Together with ๐ผ๐ผ ๐ = 21 [(2๐ผ+1 โ 1) + 2๐ผ ] = 2๐ผ + 2๐ผโ1 โ 1/2, we get
[ ]
๐ผ๐ผ ๐
ห
2๐ผโ5 โ 49
๐
ห
= ๐ผ = ๐ผ
.
๐(๐ผ) := ๐ผ๐ผ
๐
๐ผ ๐
2 + 2๐ผโ1 โ 1/2
โ๐
Now we want to calculate ๐ผ๐ := ๐ผโฃ๐ (๐) [ห
๐/๐], where ๐ (๐) := ๐=0 ๐ผ 4+๐ by using the
values ๐(๐). Because of โฃ๐ผ 5 โฃ = 2โฃ๐ผ 4 โฃ and โฃ๐ผ 4 โช๐ผ 5 โฃ = 3โฃ๐ผ 4 โฃ we have ๐ผ1 = 31 ๐(4)+ 32 ๐(5).
With the same argument we get
๐ผ๐ =
4+๐
โ
๐=4
๐ฝ๐ ๐(๐)
where
๐ฝ๐ =
2๐โ4
.
โ1
2๐+1
276
I. HEDTKE
Finally we have
( 4+๐
)
[ ]
โ
๐
ห
๐ผ
= lim ๐ผ๐ = lim
๐ฝ๐ ๐(๐)
๐โโ
๐โโ
๐
๐=4
(
)
4+๐
โ
49
49
22๐โ9
=
,
= lim
๐โโ
2๐+1 โ 1 ๐=4 2๐ + 2๐โ1 โ 1/2
48
what we intended to show.
Compared to the worst case (ห
๐โค
1
1 + 1/48 = 1 + 13 โ 16
.
17
16 ๐,
โก
17/16 = 1 + 1/16), note that 49/48 =
3. Conclusion
Strassen used the parameters ๐ and ๐ in the form (2.1) to show that his matrix
multiplication algorithm needs less than 4.7๐log2 7 ๏ฌops. We could show in this
paper, that these parameters cause an additional work of approx. 20 % in the worst
case in comparison to the optimal strategy for the worst case. This is the main
reason for the search for better parameters, like in [11].
References
[1] P. DโAlberto and A. Nicolau, Adaptive Winogradโs Matrix Multiplications, ACM Trans.
Math. Softw. 36, 1, Article 3 (March 2009).
[2] D. H. Bailey, Extra High Speed Matrix Multiplication on the Cray-2, SIAM J. Sci. Stat.
Comput., Vol. 9, Co. 3, 603โ607, 1988.
[3] B. Boyer, J.-G. Dumas, C. Pernet, and W. Zhou, Memory e๏ฌcient scheduling of StrassenWinogradโs matrix multiplication algorithm, International Symposium on Symbolic and Algebraic Computation 2009, Seฬoul.
[4] H. Cohn and C. Umans, A Group-theoretic Approach to Fast Matrix Multiplication, Proceedings of the 44th Annual Symposium on Foundations of Computer Science, 11-14 October
2003, Cambridge, MA, IEEE Computer Society, pp. 438-449.
[5] H. Cohn, R. Kleinberg, B. Szegedy and C. Umans, Group-theoretic Algorithms for Matrix
Multiplication, Proceedings of the 46th Annual Symposium on Foundations of Computer
Science, 23-25 October 2005, Pittsburgh, PA, IEEE Computer Society, pp. 379-388.
[6] D. Coppersmith and S. Winograd, Matrix Multiplication via Arithmetic Progressions, STOC
โ87: Proceedings of the nineteenth annual ACM symposium on Theory of Computing, 1987.
[7] C. C. Douglas, M. Heroux, G. Slishman and R. M. Smith, GEMMW: A Portable Level
3 BLAS Winograd Variant of Strassenโs MatrixโMatrix Multiply Algorithm, J. of Comp.
Physics, 110:1โ10, 1994.
[8] G. H. Golub and C. F. Van Loan, Matrix Computations, Third ed., The Johns Hopkins
University Press, Baltimore, MD, 1996.
[9] S. Huss-Lederman, E. M. Jacobson, J. R. Johnson, A. Tsao, and T. Turnbull, Implementation
of Strassenโs Algorithm for Matrix Multiplication, 0-89791-854-1, 1996 IEEE
[10] IBM Engineering and Scienti๏ฌc Subroutine Library Guide and Reference, 1992. Order No.
SC23-0526.
[11] P. C. Fischer and R. L. Probert, E๏ฌcient Procedures for using Matrix Algorithms, Automata,
Languages and Programming โ 2nd Colloquium, University of Saarbruฬcken, Lecture Notes
in Computer Science, 1974.
[12] R. L. Probert, On the additive complexity of matrix multiplication, SIAM J. Compu, 5:187โ
203, 1976.
[13] S. Robinson, Toward an Optimal Algorithm for Matrix Multiplication, SIAM News, Volume
38, Number 9, November 2005.
[14] V. Strassen, Gaussian Elimination is not Optimal, Numer. Math., 13:354โ356, 1969.
[15] S. Winograd, On Multiplication of 2 ร 2 Matrices, Linear Algebra and its Applications,
4:381-388, 1971.
STRASSENโS ALGORITHM FOR MATRICES OF ARBITRARY SIZE
Mathematical Institute, University of Jena, D-07737 Jena, Germany
E-mail address: Ivo.Hedtke@uni-jena.de
277