Complex constant number serial multiplie
Complex constant number serial multipliers
K.Z. Pekmestzi, P. Kalivas, N. Moshopoulos and J. Sifnaios
Abstract: An efficient implementation of a complex number serial multiplier, when the one factor
is constant, is presented. The real and imaginary parts of the constant number are represented in
canonic signed digit (CSD) form. The corresponding parts of the non-constant factor are
represented in two’s complement form. The real and imaginary parts of the product are obtained in
two’s complement form. The CSD representation was chosen because it yields significant hardware
reduction. The proposed scheme operates with 100% hardware efficiency; namely, no sign
extension words between successive data words are required.
1
Introduction
An operation that is often met in digital signal processing
(DSP) algorithms, mainly in fast fourier transform (FFT)
computations, is the complex multiplication operation.
The FFT is used in several DSP and communication
applications. A representative example is in 3G mobile
communications, where the FFT is used for the implementation of the spread spectrum process, where low power
dissipation and consequently a small amount of hardware
are required.
In high radix FFT calculation, a large number of
multiplications use a small number of constant coefficients.
These multiplications can be hardwired, resulting in
hardware reduction, whereas, for the rest, normal multipliers can be used. In addition, the high-speed operation of
hardwired multipliers allows the use of serial architectures
for further reduction of the power dissipation.
The conventional implementations of complex multipliers
are based on the relation
ðA þ jBÞðC þ jDÞ ¼ AC BD þ jðAD þ BCÞ
ð1Þ
It requires four multiplications and two additions.
Obviously, its implementation requires a significant amount
of hardware. A different approach, which considers the
complex multiplication as one operation and handles it at
the bit-level is presented in [1] for parallel and serial
implementations. This results in a significant reduction of
the hardware and the circuit layout complexity. Another
approach for the parallel implementation of the complex
multiplication is given in [2]. The hardware efficiency is
improved by combining the computations of the imaginary
and real part in the same Wallace tree.
In this paper, a new scheme is proposed for complex
multiplication when the one factor is constant. This is the
case for the FFT computation where the constant factors
are the quantities exp[(2pk/n)j], with k ¼ 0 . . . n 1. The
representation of the real and imaginary parts of the
constant coefficient in CSD form [3, 4] is suggested, because
r IEE, 2003
IEE Proceedings online no. 20030270
doi:10.1049/ip-cds:20030270
Paper first received 5th May 2000 and in revised form 26th April 2001. Online
publishing date: 19 August 2003
The authors are with the Department of Electrical and Computer Engineering,
National Technical University of Athens, 157 73 Zographou, Athens, Greece
IEE Proc.-Circuits Devices Syst., Vol. 150, No. 5, October 2003
it minimises the required hardware. The other factor is in
two’s complement form. A new algorithm for such a
complex multiplication is also developed and presented in
this paper.
For the implementation of this algorithm the serial/
parallel approach is applied, because it yields more efficient
circuits from the aspect of hardware complexity. The
disadvantages of this approach are that the resulting circuit
is not systolic and operates with 50% efficiency. Therefore,
a technique, referred to in [5] and used in [6] for
non-constant binary multipliers, is properly modified and
used to allow the proposed scheme to operate with 100%
efficiency, with negligible hardware and combinational
delay overhead.
2
Description of the algorithm
Let us consider the multiplication M ¼ AX, where X is a
complex number X ¼ Xr+jXi with Xr and Xi in two’s
complement form
Xr ¼ xr;n1 2n1 þ
n2
X
xr;l 2l
ð2Þ
n2
X
xi;l 2l
ð3Þ
l¼0
Xi ¼ xi;n1 2n1 þ
l¼0
and A is the constant complex coefficient A ¼ Ar+jAi with
Ar and Ai in CSD form
m2
X
ar;k 2k ;
ar;k ¼ 0; 1
ð4Þ
m2
X
ai;k 2k ;
ai;k ¼ 0; 1
ð5Þ
M ¼ Mr þ jMi ¼ ðAr Xr Ai Xi Þ þ jðAr Xi þ Ai Xr Þ
ð6Þ
Ar ¼ ar;m1 2m1 þ
k¼0
Ai ¼ ai;m1 2m1 þ
k¼0
Thus, (1) can be written as following
We have four multiplications of constant numbers in CSD
form with numbers in two’s complement form. First, an
algorithm for such a multiplication is presented. Next, this
algorithm is extended for the complex multiplication
according to (6).
405
imaginary part Mi is
We start from the following relation
P ¼bY ¼
m1
X
Pk 2k ¼
m1
X
bk Y 2k
Cti ¼ 2mþn þ
ð7Þ
k¼0
k¼0
k¼0
where Y is a two’s complement number
Y ¼ yn1 2n1 þ
n2
X
þ
y l 2l
ð8Þ
and b is a constant coefficient in CSD form
m1
X
m1
X
bk 2k ; bk ¼ 0; 1
ð9Þ
By applying the relation yl ¼ 1 þ y l in (7), where y l is
the inversion of yl, the partial product Pk ¼ bkY is written
as following
8
n2
P
>
>
y l 2l þ 1; for bk ¼ 1
> 2n1 þ yn1 2n1 þ
>
<
l¼0
nP
2
Pk ¼
n1
n1
>
2
þ y n1 2
þ
yl 2l ; for bk ¼ 1
>
>
>
l¼0
:
0; for bk ¼ 0
Cti ¼ 2mþn1 þ
From the above equations, it is clear that the partial
product Pk is the binary number y n1 yn1 y1 y0 for
positive bk , or yn1 y n1 y 1 y 0 for negative bk plus a ‘1’
at the most significant position. Also, for negative bk, a ‘1’ at
the least significant position must be added. Equation (10)
can be written more concisely as follows:
þ Y þ nk
ð11Þ
where
8
nP
2
>
>
y l 2l ;
> yn1 2n1 þ
>
<
l¼0
nP
2
Y ¼
n1
>
y n1 2
þ
yl 2l ;
>
>
>
l¼0
:
0; for bk ¼ 0
for bk ¼ 1
ð12Þ
for bk ¼ 1
nk is ‘1’ for negative bk and zero otherwise, and jbk j is the
absolute value of bk. Thus, the final product is
P¼
m1
X
k¼0
Pk ¼
m1
X
jbk j2n1þk þ
m1
X
Y 2k þ
n k 2k
k¼0
k¼0
k¼0
m1
X
ð13Þ
According to (13), the following constant term Ct is
included in the product.
Ct ¼
m1
X
k¼0
jbk j2n1þi þ
m1
X
nk 2k
ð14Þ
k¼0
For the result to be obtained in two’s complement form, the
Pm1
sum i¼0
jbk j2n1þi in (14) must be converted to the
same form. Thus, the constant term Ct is written as
Ct ¼ 2mþn1 þ
m1
X
k¼0
2n1þk jbk j þ 2n1 þ
m1
X
nk 2k
ð15Þ
þ
ðnr;k þ ni;k Þ2mþn1 þ 2n1
ð17Þ
k¼0
The corresponding constant term for the real part of the
result Mr is
Ctr ¼ 2mþn1 þ
m1
X
ar;k þ ai;k 1 2n1þk
k¼0
þ
m1
X
ðnr;k þ pi;k Þ2mþn1 þ 2n1
ð18Þ
k¼0
where pi,k is ‘1’ only when ai,k ¼ 1 and zero otherwise.
3
Circuit implementation
The proposed scheme is based on the serial–parallel
multiplier [7]. An example for ar ¼ 010101 and ai ¼
100100 is given in Fig. 1. The data enter the circuit through
the two inputs Xr and Xi. The imaginary and real parts of
the result are obtained from the outputs Mi and Mr
respectively. Two bits, the one from Ar and the other from
Ai, correspond to each cell. If these bits are both non-zero,
the corresponding cell includes two cascaded full-adders. If
only one of these bits is zero, one full-adder is included,
which is fed from Xr or Xi depending on which of the bits
ar,k or ai,k is non-zero. If bits ar,k and ai,k are both zero, no
full-adder is included, and the corresponding cell is empty.
The empty cells are also shown in Fig. 2 as dashed boxes.
According to (12), the inputs Xr and Xi, for the upper
(imaginary) part of the multiplier, are inverted if the
corresponding bits ai,k and ar,k are negative. Also, the inputs
Xr and Xi in the lower (real) part are inverted if the bits ar,k
and ai,k are negative and positive, respectively. The control
signal R2 is activated at the same time with the entrance of
the MSB of Xr and Xi, and through the XOR gates inverts
the data inputs. The control signal R1 is activated when the
LSB of Xr and Xi enters the circuit and initialises the carry
and sum delay elements.
Each of the constant terms Cti and Ctr is incorporated in
the above scheme as two separated parts. The low order
parts are
Cti;L ¼
k¼0
By extending the above algorithm for the complex multiplication based on (6), a total constant term is deduced from
the corresponding terms of the two added products for
the imaginary and real part of the result. This term for the
406
m1
X
ar;k þ ai;k 1 2n1þk
k¼0
m1
X
ð10Þ
Pk ¼ jbk j2
ð16Þ
where nr;k and ni;k are ‘1’ for ar;k ¼ 1 and ai;k ¼ 1
respectively and zero otherwise.
By
applying
the
relation
2mþn1 þ 2n1 ¼
Pm1 n1þk
in (16), the following equation is obtained:
k¼0 2
k¼0
n1
ðnr;k þ ni;k Þ2k þ 2n
k¼0
l¼0
b¼
m1
X
ar;k þ ai;k 2n1þk
m1
X
ðnr;k þ ni;k Þ2k
ð19Þ
m1
X
ðnr;k þ pi;k Þ2k
ð20Þ
k¼0
Ctr;L ¼
k¼0
IEE Proc.-Circuits Devices Syst., Vol. 150, No. 5, October 2003
r,5 = 0
i,5 = −1
r,4 = 1
i,4 = 0
r,2 = 1
i,2 = −1
r,3 = 0
i,3 = 0
R1
R1
R1
0
0
D
0
C
FA S D
CtH
r,0 = −1
i,0 = 0
R1
1
0
D
D
R
r,1 = 0
i,1 = 0
C
FA S D
D
0
D
R1
D
R1 C
C
FA S
0
D
0
R1
R1
D
C
FA S
0
0
Mr
R1
S
FA
R2
Xr
Xi
FA
S
1
R
0 1
R
0 1
CtH
FA S D
C
D
1
Fig. 1
R1
R1
R1
0
FA S D
C
0
D
1
D
Mi
FA S
C
D
R1
R1
R1
2-input
multiplexer
delay
element
D
0
R1
D
D
R1
D
0
FA S D
C
0
R2
Complex multiplier for ar ¼ 010101 and ai ¼ 100100
r,4 = 1
i,4 = 0
r,5 = 0
i,5 = −1
r,3 = 0
i,3 = 0
R
R
D
r,0 = −1
i,0 = 0 R
0
D
C
R
FA S Mr,H
D
R
D
D
D
S
R
R
C HA
D
D
1
D
R
C
FA S
D
0
1
D
D
R
R
R
R0
0
C
0
FA S D
D
D
D
R
R
R
0
D
R
R
R
D
0
r,1 = 0
i,1 = 0
r,2 = 1
i,2 = −1
D
0
R
R
D
C
0
FA S D
0
R
C
D
R
D
C
FA S
0
Mr,L
R
S
FA
R
Xr
Xi
FA
S
1
0R
0R
R
D
FA S D
1
C
D
0
R
HA
DS
D
D
R
R
D
Fig. 2
D
C
R
delay
element
R
D
1
Mi,L
FA S
C
D
R
R
R
D
1
R
FA S D
0
C
D
R
R
0
0
0
R
R
D
R
D
0
D
FA S D
C
R
R
R
R
D
D
D
D
R
2-input
multiplexer
D
D
0
R
FA S Mi,H
C
D
R
Complex multiplier for ar ¼ 010101 and ai ¼ 100100 operating with 100% efficiency
The high order parts are
CtH ¼ Cti;H ¼ Ctr;H
¼ 2mþn1 þ
m1
X
ar;k þ ai;k 1 2n1þk þ 2n1
k¼0
ð21Þ
IEE Proc.-Circuits Devices Syst., Vol. 150, No. 5, October 2003
The addition of Ctr,L and Cti,L is performed by initialising
properly with the control signal R1, the carry delay elements
of the full-adders at the first clock cycle of the multiplication. According to (19) and (20), when the corresponding coefficient bits are both zero, the corresponding bits of
Ctr,L and Cti,L are also zero, and no initialisation is
required. When they are both non-zero, the carries of both
407
cascaded full-adders are initialised. The parts CtH can be
added serially from the free sum inputs of the leftmost
cell as shown in Fig. 1. Obviously, significant extra
hardware is required in this case for storing and shifting
these terms.
The above scheme operates with 50% efficiency, because
zero words must be inserted between the successive data
words that enter the circuit through the lines Xr and Xi. To
achieve 100% operational efficiency, the following
technique [5, 6] is used. When the MSB of Xr and Xi enters
the circuit, the output of the least significant part of the
result ML ¼ Mr;L þ jMi;L is being completed. At the same
time, the most significant part MH ¼ Mr;H þ jMi;H is
already stored in carry–save form in the sum and carry
delay elements of each full-adder. At the next cycle, the real
and imaginary parts are loaded into two double shift
registers placed at the lower and upper parts of the
multiplier. We name the two shift registers of each double
shift register ‘sum’ and ‘carry’ depending on the data they
store.
At the same time as the loading, the full-adder rows are
starting to be involved in the next multiplication. The
content of each double shift register is shifted and added
through a serial adder, which converts the most significant
part of the results to conventional binary form. Thus, the
circuit operates in a pipeline way and 100% operational
efficiency is achieved. By applying the above technique in
the circuit of Fig. 1 we obtain the scheme shown in Fig. 2.
The low order parts of the result are obtained from the
outputs Mr,L and Mi,L, and the high order part from the
outputs Mr,H and Mi,H. The control signal R is activated
when the LSB of the new Xr and Xi enters the circuit, and
loads the carries and sums of the current multiplication into
the double shift registers.
The incorporation of Ctr,L and Cti,L is implemented by
initialising the carries of the full-adders. The bits of Ctr,H
and Cti,H are loaded into the carry shift register at the same
time with the sum and carry loading. According to (21), a
‘1’ from Ctr,H and Cti,H corresponds to every empty cell,
and a zero to the cells that include one full-adder for the real
and imaginary part. A ‘-1’ corresponds to the 4-full-adder
cells. For numbers in CSD form, at least a zero digit always
exists between non-zero digits. Thus, the adjacent cells of a
4-full-adder cell are always empty cells. In the left adjacent
empty cell the corresponding digit of Ctr,H and Cti,H is ‘1’.
Combining the ‘1’ of this cell with the ‘1’ of the 4-fulladder cell we have a ‘1’ in the 4-full-adder cell and a zero in
the left adjacent cell.
To add the above ‘1’s, we exploit the delay elements of
the carry shift registers that are not used for carry loading.
Specifically, for every sequence of k empty cells the
corresponding delay elements of the carry shift register are
not used for loading except the rightmost, where the carry c
of the adjacent right non-empty cell is loaded. The k
corresponding bits of Cti,H and Ctr,H, which are ‘1’, are
added with c using the expression
k 1
X
i¼0
2i þ c ¼ c 2k þ
k1
X
c 2i
ð22Þ
i¼0
According to (22), the carry c must be loaded and
propagated leftwards through all the cells of the sequence
until the first non-empty cell. In every cell, c is loaded
inverted except the last cell.
In the 4-full-adder cells, where a ‘1’ from Cti,H and Ctr,H
corresponds, the carry that comes from the right adjacent
empty cell continues to be propagated leftwards according
408
to (22). In the left adjacent empty cell, it is added with the
carry of the 4-full-adder cell through the half-adder, as
shown in Fig. 2. The carry of this half-adder continues to be
propagated leftwards.
The terms 2n1 in Cti,H and Ctr,H are added through the
free sum inputs of the leftmost cell when the LSB of X
enters the circuit, namely, one clock cycle after the
activation of R. Apparently, if the length n of X is greater
that the length m of A this term must be added nm+1
clocks after the activation of R. In this case, an appropriate
number of delay elements must be inserted in the sum
inputs of the leftmost cell. If nom, then the terms 2n1 must
be added through the sum input switches of the cell that are
located at the (nm)th position from left. The terms
2nþm1 are added by entering serially a ‘1’ into the left end
of each carry shift register.
The combinational delay of the above circuit is twice as
much as the combinational delay of a simple serial-parallel
adder, because of the cascaded full-adders in the 4-fulladder cells. To avoid this, we apply a delay rearrangement
[8–10] based on a graph property, which transform these
cells in pipeline form. According to this property, if we
consider all the lines that are intersected by a cut across a
graph, we can remove one delay element from all the lines
that have the same direction and insert a new one into the
remaining lines with the opposite direction. The resulting
circuit is shown in Fig. 3. The cut in the double cell is also
shown in this Figure.
The result of the rearrangement is a delay element to be
removed from the sum lines and the double shift registers. A
new one is inserted into the lines Xr, Xi, R, the sum lines
connecting the cascaded full-adders and the lines that
propagate the carry leftwards. The switches that correspond
to the removed delay elements are merged with the switches
of the next delay elements. The result is that these switches
are activated for two clock cycles and consequently the sum
and carry loading in the 4-full-adder cells is expanded in two
cycles. During the second cycle, the carry of the upper fulladder is loaded into the shift register. This requires an extra
clock cycle at the end, namely, a zero bit between the data
words.
Another consequence of the above transformation is the
decrease of the broadcasting of the lines Xr, Xi and R.
Further reduction of this broadcasting can be achieved by
applying the previously mentioned graph property to the
empty cells. In this case, six delay elements are removed
from sum lines and shift registers and five new are inserted
into the lines Xr, Xi, R and the lines that propagate the carry
leftwards. Obviously, this transformation cannot be applied
to the right adjacent cell of each 4-full-adder cell, because it
increases the combinational delay. The critical path delay of
the above scheme is equal to the delay of one XOR gate and
one full-adder.
In Table 1, the proposed scheme is compared from the
aspect of hardware complexity with a conventional scheme,
which consists of four multipliers, where the constant term
is represented in two’s complement form. In this case the
number of non-zero bits is m/2, where m is the length of the
constant term. For the estimation of the hardware
complexity of the proposed complex multiplier, we have
assumed the average case where the number of the zero bits
in the CSD representation is 2m/3. Two coefficient bits
correspond to every cell. Consequently, the proposed
scheme consists of 4m/9 empty cells, m/9 4-full-adder cells
and 4m/9 cells with two full-adders. Also, m/9 of the empty
cells are right adjacent to 4-full-adder cells and cannot be
transformed. The hardware estimation for each type of cell
is given in Table 2.
IEE Proc.-Circuits Devices Syst., Vol. 150, No. 5, October 2003
r,5 = 0
i,5 = −1
r,4 = 1
i,4 = 0
R2
R2
R2
0
D
R2
0
C
FA S D
D
R2,1
R1
1
2
0
C
FA S
D
R2
0
0
D
D
D
R2 R
D
R2
C
FA S Mr,H
D
R2
R2
0
D
D
D
D
1
R2
D
R2,1
D
S
C HA
D
r,0 = −1
R1
ai,0 = 0
0
r,1 = 0
i,1 = 0
R2
R2
D
0
r,2 = 1
i,2 = −1
r,3 = 0
i,3 = 0
D
C
FA S
D
D
D
R2
C
FA S
0
Mr,L
R2,1
0
D
R1 C
S
FA
R1
D
Xr
Xi
D
FA
S
1
R1
D
0 R2
R2
D
FA S D
C
D
1
R1
0 R2
FA S D
C
0
R2
D
D
R2
D
1
D
D
D
D
C HA
S
D
R2
R2
R2,1
D
D
D
D
D
R2
Fig. 3
delay
element
2-input
multiplexer
Ri,j : Ri
FA S Mi,H
C
0
R2,1
D
R1
R1
D
D
Mi,L
FA S
C
D
R1
R2
R2
1
R2
0
R2,1
R2
R2
0
0
FA S
C
D
R2
R2
D
0
R
Rj
Final form of the proposed complex multiplier for ar ¼ 010101 and ai ¼ 100100
Table 1: Comparison of the proposed design with a four multipliers scheme
Multiplication scheme
Hardware complexity per coefficient bit (transistors)
Four serial/parallel multipliers
2FA+6DE ¼ 96
Proposed scheme
4m
2m
64m
26m
FA þ
HA þ
DE þ
SW ¼ 108
3
9
9
9
Efficiency
50%
100%
FA: full-adder (24 transistors); HA: half-adder (10 transistors); DE: dynamic delay element (8 transistors); SW: switch (6 transistors) [11]
Table 2: Hardware complexity for each type of cell of the proposed complex multiplier.
Multiplication scheme
No. of cells
Hardware complexity
4-full-adder cells
m
9
m
9
3m
9
4m
9
4FA+2HA+11DE+4SW
Untransformed empty cells
Transformed empty cells
Cells with twofull-adder
Total hardware complexity
6DE+6SW
5DE
2FA+8DE+4SW
4m
2m
64m
26m
FA þ
HA þ
DE þ
SW
3
9
9
9
FA: full-adder (24 transistors); HA: half-adder (10 transistors); DE: dynamic delay element (8 transistors); SW: switch (6 transistors) [11]
IEE Proc.-Circuits Devices Syst., Vol. 150, No. 5, October 2003
409
According to Table 1, the proposed design has lower
hardware complexity compared to a scheme that consists of
four multipliers taking into account that two circuits of the
latter are required in order 100% operational efficiency to
be achieved. Apparently, a significant part of the hardware
efficiency is owed to the CSD representation. The proposed
scheme is not fully systolic. A delay element is inserted in X,
Y and R lines every three cells on average and reduces the
broadcasting of these signals.
4
Conclusions
In this paper, a serial complex multiplier for constant factor
in CSD form has been presented. An algorithm for the
multiplication of a constant number in CSD form with a
number in two’s complement form has been introduced.
This algorithm significantly reduces the hardware complexity. Moreover, the proposed circuit operates with 100%
efficiency. The above algorithm can be also applied for the
implementation of a parallel complex constant number
multiplier. All circuits presented in this paper are extensively
verified through simulation.
410
5
References
1 Pekmestzi, K.Z.: ‘Complex number multipliers’, IEE Proc. E, Comput.
Digit. Tech., 1989, 136, (1), pp. 70–75
2 Oklobdzija, V.G., Villeger, D., and Soulass, T.: ‘An integrated
multiplier for complex numbers’, J. VLSI Signal Process., 1994, 7,
pp. 213–221
3 Reitwiesner, G.W.: ‘Binary Arithmetic’, in ‘Advances in computers’
(Academic Press, New York, 1966), Vol. 1, pp. 261–265
4 Peled, A.: ‘On the hardware implementation of digital signal
processors’, IEEE Trans. Acoust. Speech Signal Process., 1976, 24,
pp. 76–86
5 Dadda, L., and Breveglieri, L.: ‘A modular bit serial convolver’.
Proceedings of 3rd IFIP, Wafer scale integration III, (North Holland,
Amsterdam, 1990), pp. 279–289
6 Even, G.: ‘Two’s complement pipeline multipliers’, Integr., VLSI J.,
1997, 22, pp. 23–38
7 Dadda, L.: ‘On serial-input multipliers for two’s complement
numbers’, IEEE Trans. Comput., 1989, 38, (9), pp. 13411–1345
8 Kung, S.Y.: ‘On supercomputing with systolic/wavefront arrays’,
Proc. IEEE, 1984, 72, (7), pp. 867–884
9 Caraiscos, C., and Pekmestzi, K.Z.: ‘Low-latency bit-parallel systolic
VLSI implementation of FIR digital filters’, IEEE Trans. Circuits
Syst. II, Analog Digit. Signal Process., 1996, 43, (7), pp. 529–534
10 Caraiscos, C., and Pekmestzi, K.Z.: ‘A class of systolic bit-serial
multipliers’, Int. J. Electron., 1994, 76, (3), pp. 463–468
11 Weste, N., and Eshraghian, K.: ‘Principles of CMOS VLSI design’
(Addison-Wesley, Reading, 1994)
IEE Proc.-Circuits Devices Syst., Vol. 150, No. 5, October 2003
K.Z. Pekmestzi, P. Kalivas, N. Moshopoulos and J. Sifnaios
Abstract: An efficient implementation of a complex number serial multiplier, when the one factor
is constant, is presented. The real and imaginary parts of the constant number are represented in
canonic signed digit (CSD) form. The corresponding parts of the non-constant factor are
represented in two’s complement form. The real and imaginary parts of the product are obtained in
two’s complement form. The CSD representation was chosen because it yields significant hardware
reduction. The proposed scheme operates with 100% hardware efficiency; namely, no sign
extension words between successive data words are required.
1
Introduction
An operation that is often met in digital signal processing
(DSP) algorithms, mainly in fast fourier transform (FFT)
computations, is the complex multiplication operation.
The FFT is used in several DSP and communication
applications. A representative example is in 3G mobile
communications, where the FFT is used for the implementation of the spread spectrum process, where low power
dissipation and consequently a small amount of hardware
are required.
In high radix FFT calculation, a large number of
multiplications use a small number of constant coefficients.
These multiplications can be hardwired, resulting in
hardware reduction, whereas, for the rest, normal multipliers can be used. In addition, the high-speed operation of
hardwired multipliers allows the use of serial architectures
for further reduction of the power dissipation.
The conventional implementations of complex multipliers
are based on the relation
ðA þ jBÞðC þ jDÞ ¼ AC BD þ jðAD þ BCÞ
ð1Þ
It requires four multiplications and two additions.
Obviously, its implementation requires a significant amount
of hardware. A different approach, which considers the
complex multiplication as one operation and handles it at
the bit-level is presented in [1] for parallel and serial
implementations. This results in a significant reduction of
the hardware and the circuit layout complexity. Another
approach for the parallel implementation of the complex
multiplication is given in [2]. The hardware efficiency is
improved by combining the computations of the imaginary
and real part in the same Wallace tree.
In this paper, a new scheme is proposed for complex
multiplication when the one factor is constant. This is the
case for the FFT computation where the constant factors
are the quantities exp[(2pk/n)j], with k ¼ 0 . . . n 1. The
representation of the real and imaginary parts of the
constant coefficient in CSD form [3, 4] is suggested, because
r IEE, 2003
IEE Proceedings online no. 20030270
doi:10.1049/ip-cds:20030270
Paper first received 5th May 2000 and in revised form 26th April 2001. Online
publishing date: 19 August 2003
The authors are with the Department of Electrical and Computer Engineering,
National Technical University of Athens, 157 73 Zographou, Athens, Greece
IEE Proc.-Circuits Devices Syst., Vol. 150, No. 5, October 2003
it minimises the required hardware. The other factor is in
two’s complement form. A new algorithm for such a
complex multiplication is also developed and presented in
this paper.
For the implementation of this algorithm the serial/
parallel approach is applied, because it yields more efficient
circuits from the aspect of hardware complexity. The
disadvantages of this approach are that the resulting circuit
is not systolic and operates with 50% efficiency. Therefore,
a technique, referred to in [5] and used in [6] for
non-constant binary multipliers, is properly modified and
used to allow the proposed scheme to operate with 100%
efficiency, with negligible hardware and combinational
delay overhead.
2
Description of the algorithm
Let us consider the multiplication M ¼ AX, where X is a
complex number X ¼ Xr+jXi with Xr and Xi in two’s
complement form
Xr ¼ xr;n1 2n1 þ
n2
X
xr;l 2l
ð2Þ
n2
X
xi;l 2l
ð3Þ
l¼0
Xi ¼ xi;n1 2n1 þ
l¼0
and A is the constant complex coefficient A ¼ Ar+jAi with
Ar and Ai in CSD form
m2
X
ar;k 2k ;
ar;k ¼ 0; 1
ð4Þ
m2
X
ai;k 2k ;
ai;k ¼ 0; 1
ð5Þ
M ¼ Mr þ jMi ¼ ðAr Xr Ai Xi Þ þ jðAr Xi þ Ai Xr Þ
ð6Þ
Ar ¼ ar;m1 2m1 þ
k¼0
Ai ¼ ai;m1 2m1 þ
k¼0
Thus, (1) can be written as following
We have four multiplications of constant numbers in CSD
form with numbers in two’s complement form. First, an
algorithm for such a multiplication is presented. Next, this
algorithm is extended for the complex multiplication
according to (6).
405
imaginary part Mi is
We start from the following relation
P ¼bY ¼
m1
X
Pk 2k ¼
m1
X
bk Y 2k
Cti ¼ 2mþn þ
ð7Þ
k¼0
k¼0
k¼0
where Y is a two’s complement number
Y ¼ yn1 2n1 þ
n2
X
þ
y l 2l
ð8Þ
and b is a constant coefficient in CSD form
m1
X
m1
X
bk 2k ; bk ¼ 0; 1
ð9Þ
By applying the relation yl ¼ 1 þ y l in (7), where y l is
the inversion of yl, the partial product Pk ¼ bkY is written
as following
8
n2
P
>
>
y l 2l þ 1; for bk ¼ 1
> 2n1 þ yn1 2n1 þ
>
<
l¼0
nP
2
Pk ¼
n1
n1
>
2
þ y n1 2
þ
yl 2l ; for bk ¼ 1
>
>
>
l¼0
:
0; for bk ¼ 0
Cti ¼ 2mþn1 þ
From the above equations, it is clear that the partial
product Pk is the binary number y n1 yn1 y1 y0 for
positive bk , or yn1 y n1 y 1 y 0 for negative bk plus a ‘1’
at the most significant position. Also, for negative bk, a ‘1’ at
the least significant position must be added. Equation (10)
can be written more concisely as follows:
þ Y þ nk
ð11Þ
where
8
nP
2
>
>
y l 2l ;
> yn1 2n1 þ
>
<
l¼0
nP
2
Y ¼
n1
>
y n1 2
þ
yl 2l ;
>
>
>
l¼0
:
0; for bk ¼ 0
for bk ¼ 1
ð12Þ
for bk ¼ 1
nk is ‘1’ for negative bk and zero otherwise, and jbk j is the
absolute value of bk. Thus, the final product is
P¼
m1
X
k¼0
Pk ¼
m1
X
jbk j2n1þk þ
m1
X
Y 2k þ
n k 2k
k¼0
k¼0
k¼0
m1
X
ð13Þ
According to (13), the following constant term Ct is
included in the product.
Ct ¼
m1
X
k¼0
jbk j2n1þi þ
m1
X
nk 2k
ð14Þ
k¼0
For the result to be obtained in two’s complement form, the
Pm1
sum i¼0
jbk j2n1þi in (14) must be converted to the
same form. Thus, the constant term Ct is written as
Ct ¼ 2mþn1 þ
m1
X
k¼0
2n1þk jbk j þ 2n1 þ
m1
X
nk 2k
ð15Þ
þ
ðnr;k þ ni;k Þ2mþn1 þ 2n1
ð17Þ
k¼0
The corresponding constant term for the real part of the
result Mr is
Ctr ¼ 2mþn1 þ
m1
X
ar;k þ ai;k 1 2n1þk
k¼0
þ
m1
X
ðnr;k þ pi;k Þ2mþn1 þ 2n1
ð18Þ
k¼0
where pi,k is ‘1’ only when ai,k ¼ 1 and zero otherwise.
3
Circuit implementation
The proposed scheme is based on the serial–parallel
multiplier [7]. An example for ar ¼ 010101 and ai ¼
100100 is given in Fig. 1. The data enter the circuit through
the two inputs Xr and Xi. The imaginary and real parts of
the result are obtained from the outputs Mi and Mr
respectively. Two bits, the one from Ar and the other from
Ai, correspond to each cell. If these bits are both non-zero,
the corresponding cell includes two cascaded full-adders. If
only one of these bits is zero, one full-adder is included,
which is fed from Xr or Xi depending on which of the bits
ar,k or ai,k is non-zero. If bits ar,k and ai,k are both zero, no
full-adder is included, and the corresponding cell is empty.
The empty cells are also shown in Fig. 2 as dashed boxes.
According to (12), the inputs Xr and Xi, for the upper
(imaginary) part of the multiplier, are inverted if the
corresponding bits ai,k and ar,k are negative. Also, the inputs
Xr and Xi in the lower (real) part are inverted if the bits ar,k
and ai,k are negative and positive, respectively. The control
signal R2 is activated at the same time with the entrance of
the MSB of Xr and Xi, and through the XOR gates inverts
the data inputs. The control signal R1 is activated when the
LSB of Xr and Xi enters the circuit and initialises the carry
and sum delay elements.
Each of the constant terms Cti and Ctr is incorporated in
the above scheme as two separated parts. The low order
parts are
Cti;L ¼
k¼0
By extending the above algorithm for the complex multiplication based on (6), a total constant term is deduced from
the corresponding terms of the two added products for
the imaginary and real part of the result. This term for the
406
m1
X
ar;k þ ai;k 1 2n1þk
k¼0
m1
X
ð10Þ
Pk ¼ jbk j2
ð16Þ
where nr;k and ni;k are ‘1’ for ar;k ¼ 1 and ai;k ¼ 1
respectively and zero otherwise.
By
applying
the
relation
2mþn1 þ 2n1 ¼
Pm1 n1þk
in (16), the following equation is obtained:
k¼0 2
k¼0
n1
ðnr;k þ ni;k Þ2k þ 2n
k¼0
l¼0
b¼
m1
X
ar;k þ ai;k 2n1þk
m1
X
ðnr;k þ ni;k Þ2k
ð19Þ
m1
X
ðnr;k þ pi;k Þ2k
ð20Þ
k¼0
Ctr;L ¼
k¼0
IEE Proc.-Circuits Devices Syst., Vol. 150, No. 5, October 2003
r,5 = 0
i,5 = −1
r,4 = 1
i,4 = 0
r,2 = 1
i,2 = −1
r,3 = 0
i,3 = 0
R1
R1
R1
0
0
D
0
C
FA S D
CtH
r,0 = −1
i,0 = 0
R1
1
0
D
D
R
r,1 = 0
i,1 = 0
C
FA S D
D
0
D
R1
D
R1 C
C
FA S
0
D
0
R1
R1
D
C
FA S
0
0
Mr
R1
S
FA
R2
Xr
Xi
FA
S
1
R
0 1
R
0 1
CtH
FA S D
C
D
1
Fig. 1
R1
R1
R1
0
FA S D
C
0
D
1
D
Mi
FA S
C
D
R1
R1
R1
2-input
multiplexer
delay
element
D
0
R1
D
D
R1
D
0
FA S D
C
0
R2
Complex multiplier for ar ¼ 010101 and ai ¼ 100100
r,4 = 1
i,4 = 0
r,5 = 0
i,5 = −1
r,3 = 0
i,3 = 0
R
R
D
r,0 = −1
i,0 = 0 R
0
D
C
R
FA S Mr,H
D
R
D
D
D
S
R
R
C HA
D
D
1
D
R
C
FA S
D
0
1
D
D
R
R
R
R0
0
C
0
FA S D
D
D
D
R
R
R
0
D
R
R
R
D
0
r,1 = 0
i,1 = 0
r,2 = 1
i,2 = −1
D
0
R
R
D
C
0
FA S D
0
R
C
D
R
D
C
FA S
0
Mr,L
R
S
FA
R
Xr
Xi
FA
S
1
0R
0R
R
D
FA S D
1
C
D
0
R
HA
DS
D
D
R
R
D
Fig. 2
D
C
R
delay
element
R
D
1
Mi,L
FA S
C
D
R
R
R
D
1
R
FA S D
0
C
D
R
R
0
0
0
R
R
D
R
D
0
D
FA S D
C
R
R
R
R
D
D
D
D
R
2-input
multiplexer
D
D
0
R
FA S Mi,H
C
D
R
Complex multiplier for ar ¼ 010101 and ai ¼ 100100 operating with 100% efficiency
The high order parts are
CtH ¼ Cti;H ¼ Ctr;H
¼ 2mþn1 þ
m1
X
ar;k þ ai;k 1 2n1þk þ 2n1
k¼0
ð21Þ
IEE Proc.-Circuits Devices Syst., Vol. 150, No. 5, October 2003
The addition of Ctr,L and Cti,L is performed by initialising
properly with the control signal R1, the carry delay elements
of the full-adders at the first clock cycle of the multiplication. According to (19) and (20), when the corresponding coefficient bits are both zero, the corresponding bits of
Ctr,L and Cti,L are also zero, and no initialisation is
required. When they are both non-zero, the carries of both
407
cascaded full-adders are initialised. The parts CtH can be
added serially from the free sum inputs of the leftmost
cell as shown in Fig. 1. Obviously, significant extra
hardware is required in this case for storing and shifting
these terms.
The above scheme operates with 50% efficiency, because
zero words must be inserted between the successive data
words that enter the circuit through the lines Xr and Xi. To
achieve 100% operational efficiency, the following
technique [5, 6] is used. When the MSB of Xr and Xi enters
the circuit, the output of the least significant part of the
result ML ¼ Mr;L þ jMi;L is being completed. At the same
time, the most significant part MH ¼ Mr;H þ jMi;H is
already stored in carry–save form in the sum and carry
delay elements of each full-adder. At the next cycle, the real
and imaginary parts are loaded into two double shift
registers placed at the lower and upper parts of the
multiplier. We name the two shift registers of each double
shift register ‘sum’ and ‘carry’ depending on the data they
store.
At the same time as the loading, the full-adder rows are
starting to be involved in the next multiplication. The
content of each double shift register is shifted and added
through a serial adder, which converts the most significant
part of the results to conventional binary form. Thus, the
circuit operates in a pipeline way and 100% operational
efficiency is achieved. By applying the above technique in
the circuit of Fig. 1 we obtain the scheme shown in Fig. 2.
The low order parts of the result are obtained from the
outputs Mr,L and Mi,L, and the high order part from the
outputs Mr,H and Mi,H. The control signal R is activated
when the LSB of the new Xr and Xi enters the circuit, and
loads the carries and sums of the current multiplication into
the double shift registers.
The incorporation of Ctr,L and Cti,L is implemented by
initialising the carries of the full-adders. The bits of Ctr,H
and Cti,H are loaded into the carry shift register at the same
time with the sum and carry loading. According to (21), a
‘1’ from Ctr,H and Cti,H corresponds to every empty cell,
and a zero to the cells that include one full-adder for the real
and imaginary part. A ‘-1’ corresponds to the 4-full-adder
cells. For numbers in CSD form, at least a zero digit always
exists between non-zero digits. Thus, the adjacent cells of a
4-full-adder cell are always empty cells. In the left adjacent
empty cell the corresponding digit of Ctr,H and Cti,H is ‘1’.
Combining the ‘1’ of this cell with the ‘1’ of the 4-fulladder cell we have a ‘1’ in the 4-full-adder cell and a zero in
the left adjacent cell.
To add the above ‘1’s, we exploit the delay elements of
the carry shift registers that are not used for carry loading.
Specifically, for every sequence of k empty cells the
corresponding delay elements of the carry shift register are
not used for loading except the rightmost, where the carry c
of the adjacent right non-empty cell is loaded. The k
corresponding bits of Cti,H and Ctr,H, which are ‘1’, are
added with c using the expression
k 1
X
i¼0
2i þ c ¼ c 2k þ
k1
X
c 2i
ð22Þ
i¼0
According to (22), the carry c must be loaded and
propagated leftwards through all the cells of the sequence
until the first non-empty cell. In every cell, c is loaded
inverted except the last cell.
In the 4-full-adder cells, where a ‘1’ from Cti,H and Ctr,H
corresponds, the carry that comes from the right adjacent
empty cell continues to be propagated leftwards according
408
to (22). In the left adjacent empty cell, it is added with the
carry of the 4-full-adder cell through the half-adder, as
shown in Fig. 2. The carry of this half-adder continues to be
propagated leftwards.
The terms 2n1 in Cti,H and Ctr,H are added through the
free sum inputs of the leftmost cell when the LSB of X
enters the circuit, namely, one clock cycle after the
activation of R. Apparently, if the length n of X is greater
that the length m of A this term must be added nm+1
clocks after the activation of R. In this case, an appropriate
number of delay elements must be inserted in the sum
inputs of the leftmost cell. If nom, then the terms 2n1 must
be added through the sum input switches of the cell that are
located at the (nm)th position from left. The terms
2nþm1 are added by entering serially a ‘1’ into the left end
of each carry shift register.
The combinational delay of the above circuit is twice as
much as the combinational delay of a simple serial-parallel
adder, because of the cascaded full-adders in the 4-fulladder cells. To avoid this, we apply a delay rearrangement
[8–10] based on a graph property, which transform these
cells in pipeline form. According to this property, if we
consider all the lines that are intersected by a cut across a
graph, we can remove one delay element from all the lines
that have the same direction and insert a new one into the
remaining lines with the opposite direction. The resulting
circuit is shown in Fig. 3. The cut in the double cell is also
shown in this Figure.
The result of the rearrangement is a delay element to be
removed from the sum lines and the double shift registers. A
new one is inserted into the lines Xr, Xi, R, the sum lines
connecting the cascaded full-adders and the lines that
propagate the carry leftwards. The switches that correspond
to the removed delay elements are merged with the switches
of the next delay elements. The result is that these switches
are activated for two clock cycles and consequently the sum
and carry loading in the 4-full-adder cells is expanded in two
cycles. During the second cycle, the carry of the upper fulladder is loaded into the shift register. This requires an extra
clock cycle at the end, namely, a zero bit between the data
words.
Another consequence of the above transformation is the
decrease of the broadcasting of the lines Xr, Xi and R.
Further reduction of this broadcasting can be achieved by
applying the previously mentioned graph property to the
empty cells. In this case, six delay elements are removed
from sum lines and shift registers and five new are inserted
into the lines Xr, Xi, R and the lines that propagate the carry
leftwards. Obviously, this transformation cannot be applied
to the right adjacent cell of each 4-full-adder cell, because it
increases the combinational delay. The critical path delay of
the above scheme is equal to the delay of one XOR gate and
one full-adder.
In Table 1, the proposed scheme is compared from the
aspect of hardware complexity with a conventional scheme,
which consists of four multipliers, where the constant term
is represented in two’s complement form. In this case the
number of non-zero bits is m/2, where m is the length of the
constant term. For the estimation of the hardware
complexity of the proposed complex multiplier, we have
assumed the average case where the number of the zero bits
in the CSD representation is 2m/3. Two coefficient bits
correspond to every cell. Consequently, the proposed
scheme consists of 4m/9 empty cells, m/9 4-full-adder cells
and 4m/9 cells with two full-adders. Also, m/9 of the empty
cells are right adjacent to 4-full-adder cells and cannot be
transformed. The hardware estimation for each type of cell
is given in Table 2.
IEE Proc.-Circuits Devices Syst., Vol. 150, No. 5, October 2003
r,5 = 0
i,5 = −1
r,4 = 1
i,4 = 0
R2
R2
R2
0
D
R2
0
C
FA S D
D
R2,1
R1
1
2
0
C
FA S
D
R2
0
0
D
D
D
R2 R
D
R2
C
FA S Mr,H
D
R2
R2
0
D
D
D
D
1
R2
D
R2,1
D
S
C HA
D
r,0 = −1
R1
ai,0 = 0
0
r,1 = 0
i,1 = 0
R2
R2
D
0
r,2 = 1
i,2 = −1
r,3 = 0
i,3 = 0
D
C
FA S
D
D
D
R2
C
FA S
0
Mr,L
R2,1
0
D
R1 C
S
FA
R1
D
Xr
Xi
D
FA
S
1
R1
D
0 R2
R2
D
FA S D
C
D
1
R1
0 R2
FA S D
C
0
R2
D
D
R2
D
1
D
D
D
D
C HA
S
D
R2
R2
R2,1
D
D
D
D
D
R2
Fig. 3
delay
element
2-input
multiplexer
Ri,j : Ri
FA S Mi,H
C
0
R2,1
D
R1
R1
D
D
Mi,L
FA S
C
D
R1
R2
R2
1
R2
0
R2,1
R2
R2
0
0
FA S
C
D
R2
R2
D
0
R
Rj
Final form of the proposed complex multiplier for ar ¼ 010101 and ai ¼ 100100
Table 1: Comparison of the proposed design with a four multipliers scheme
Multiplication scheme
Hardware complexity per coefficient bit (transistors)
Four serial/parallel multipliers
2FA+6DE ¼ 96
Proposed scheme
4m
2m
64m
26m
FA þ
HA þ
DE þ
SW ¼ 108
3
9
9
9
Efficiency
50%
100%
FA: full-adder (24 transistors); HA: half-adder (10 transistors); DE: dynamic delay element (8 transistors); SW: switch (6 transistors) [11]
Table 2: Hardware complexity for each type of cell of the proposed complex multiplier.
Multiplication scheme
No. of cells
Hardware complexity
4-full-adder cells
m
9
m
9
3m
9
4m
9
4FA+2HA+11DE+4SW
Untransformed empty cells
Transformed empty cells
Cells with twofull-adder
Total hardware complexity
6DE+6SW
5DE
2FA+8DE+4SW
4m
2m
64m
26m
FA þ
HA þ
DE þ
SW
3
9
9
9
FA: full-adder (24 transistors); HA: half-adder (10 transistors); DE: dynamic delay element (8 transistors); SW: switch (6 transistors) [11]
IEE Proc.-Circuits Devices Syst., Vol. 150, No. 5, October 2003
409
According to Table 1, the proposed design has lower
hardware complexity compared to a scheme that consists of
four multipliers taking into account that two circuits of the
latter are required in order 100% operational efficiency to
be achieved. Apparently, a significant part of the hardware
efficiency is owed to the CSD representation. The proposed
scheme is not fully systolic. A delay element is inserted in X,
Y and R lines every three cells on average and reduces the
broadcasting of these signals.
4
Conclusions
In this paper, a serial complex multiplier for constant factor
in CSD form has been presented. An algorithm for the
multiplication of a constant number in CSD form with a
number in two’s complement form has been introduced.
This algorithm significantly reduces the hardware complexity. Moreover, the proposed circuit operates with 100%
efficiency. The above algorithm can be also applied for the
implementation of a parallel complex constant number
multiplier. All circuits presented in this paper are extensively
verified through simulation.
410
5
References
1 Pekmestzi, K.Z.: ‘Complex number multipliers’, IEE Proc. E, Comput.
Digit. Tech., 1989, 136, (1), pp. 70–75
2 Oklobdzija, V.G., Villeger, D., and Soulass, T.: ‘An integrated
multiplier for complex numbers’, J. VLSI Signal Process., 1994, 7,
pp. 213–221
3 Reitwiesner, G.W.: ‘Binary Arithmetic’, in ‘Advances in computers’
(Academic Press, New York, 1966), Vol. 1, pp. 261–265
4 Peled, A.: ‘On the hardware implementation of digital signal
processors’, IEEE Trans. Acoust. Speech Signal Process., 1976, 24,
pp. 76–86
5 Dadda, L., and Breveglieri, L.: ‘A modular bit serial convolver’.
Proceedings of 3rd IFIP, Wafer scale integration III, (North Holland,
Amsterdam, 1990), pp. 279–289
6 Even, G.: ‘Two’s complement pipeline multipliers’, Integr., VLSI J.,
1997, 22, pp. 23–38
7 Dadda, L.: ‘On serial-input multipliers for two’s complement
numbers’, IEEE Trans. Comput., 1989, 38, (9), pp. 13411–1345
8 Kung, S.Y.: ‘On supercomputing with systolic/wavefront arrays’,
Proc. IEEE, 1984, 72, (7), pp. 867–884
9 Caraiscos, C., and Pekmestzi, K.Z.: ‘Low-latency bit-parallel systolic
VLSI implementation of FIR digital filters’, IEEE Trans. Circuits
Syst. II, Analog Digit. Signal Process., 1996, 43, (7), pp. 529–534
10 Caraiscos, C., and Pekmestzi, K.Z.: ‘A class of systolic bit-serial
multipliers’, Int. J. Electron., 1994, 76, (3), pp. 463–468
11 Weste, N., and Eshraghian, K.: ‘Principles of CMOS VLSI design’
(Addison-Wesley, Reading, 1994)
IEE Proc.-Circuits Devices Syst., Vol. 150, No. 5, October 2003