THE PROCESS OF QUANTIZATION AND ERROR CHARACTERIZATIONS
6.7 THE PROCESS OF QUANTIZATION AND ERROR CHARACTERIZATIONS
From the discussion of number representations in the previous section, it should be clear that a general infinite-precision real number must be as- signed to one of the finite representable number, given a specific structure for the finite-length register (that is, the arithmetic as well as the format). Usually in practice, there are two different operations by which this as- signment is made to the nearest number or level: the truncation operation and the rounding operation. These operations affect the accuracy as well as general characteristics of digital filters and DSP operations.
We assume, without loss of generality, that there are B + 1 bits in the fixed-point (fractional) arithmetic or in the mantissa of floating-point
Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
The Process of Quantization and Error Characterizations 267
arithmetic including the sign bit. Then the resolution (∆) is given by
− B absolute in the case of fixed-point arithmetic
(6.45) relative in the case of floating-point arithmetic
6.7.1 FIXED-POINT ARITHMETIC The quantizer block diagram in this case is given by
−→ Quantizer Q[·] −→
Q[x]
Infinite−precision
Finite−precision
where B, the number of fractional bits, and ∆, the resolution, are the pa- rameters of the quantizer. We will denote the finite word-length number, after quantization, by Q[x] for an input number x. Let the quantization error be given by
(6.46) We will analyze this error for both the truncation and the rounding
e △ = Q[x] − x
operations. Truncation operation In this operation, the number x is truncated
beyond B significant bits (that is, the rest of the bits are eliminated) to obtain Q T [x]. In MATLAB, to obtain a B-bit truncation, we have to first
scale the number x upward by 2 B , then use the fix function on the scaled number, and finally scale the result down by 2 − B . Thus, the MATLAB statement xhat = fix(x*2^B)/2^B; implements the desired operation. We will now consider each of the 3 formats.
Sign-magnitude format If the number x is positive, then after trun- cation Q T [x] ≤ x since some value in x is lost. Hence quantizer error for truncation denoted by e T is less than or equal to 0 or e T ≤ 0. However, since there are B bits in the quantizer, the maximum error in terms of magnitude is
B bits
or
(6.48) Similarly, if the x < 0 then after truncation Q T [x] ≥ x since Q T [x] is less
−2 − B ≤e T ≤ 0, for x ≥ 0
negative, or e T ≥ 0. The largest magnitude of this error is again 2 − B or
0≤e T ≤2 − B , for x < 0
Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Chapter 6
IMPLEMENTATION OF DISCRETE-TIME FILTERS
1 0.75 x xhat 0.5 0.25 xhat 0 − 0.25 − 0.5 − 0.75 − 1
FIGURE 6.25 Truncation error characteristics in the sign-magnitude format
! EXAMPLE 6.20 Let −1 < x < 1 and B = 2. Using MATLAB, verify the truncation error characteristics.
Solution The resolution is ∆ = 2 − 2 = 0.25. Using the following MATLAB script, we can verify the truncation error e T relations given in (6.48) and (6.49).
x = [-1+2^(-10):2^(-10):1-2^(-10)]; % Sign-Mag numbers between -1 and 1 B = 2;
% Number of bits for Truncation xhat = fix(x*2^B)/2^B
% Truncation
plot(x,x,’g’,x,xhat,’r’,’linewidth’,1); % Plot
The resulting plots of x and ˆ x are shown in Figure 6.25. Note that the plot of x has a staircase shape and that it satisfies (6.48) and (6.49). ˆ
! One’s-complement format For x ≥ 0, we have the same character-
istics for e T as in sign-magnitude format—that is,
(6.50) For x < 0, the representation is obtained by complementing all bits in-
−2 − B ≤e T ≤ 0, for x ≥ 0
cluding sign bit. To compute maximum error, let
x=1 " b 1 b 2 ···b B 000 · · · = − { " (1 − b 1 ) (1 − b 2 ) · · · (1 − b B ) 111 · · ·}
After truncation, we obtain
Q T [x] = 1 " b 1 b 2 ···b B =−{ " (1 − b 1 ) (1 − b 2 ) · · · (1 − b B )}
Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
The Process of Quantization and Error Characterizations 269
1 0.75 x xhat 0.5 0.25
FIGURE 6.26 Truncation error characteristics in the one’s-complement format Clearly, x is more negative than Q T [x] or x ≤ Q T [x] or e T ≥ 0. In fact,
the maximum truncation error is
e Tmax =0 "
00 · · · 0111 · · · = 2 − B (decimal)
(6.51) ! EXAMPLE 6.21 Again let −1 < x < 1 and B = 2 with the resolution ∆ = 2 − 2 = 0.25. Using
MATLAB script, verify the truncation error e T relations given in (6.50) and (6.51).
Solution The MATLAB script uses functions sm2oc and oc2sm, which are explored in Problem P6.25.
x = [-1+2^(-10):2^(-10):1-2^(-10)]; % Sign-Magnitude numbers between -1 and 1 B = 2;
% Select bits for Truncation y = sm2oc(x,B);
% Sign-Mag to One’s Complement yhat = fix(y*2^B)/2^B;
% Truncation
xhat = oc2sm(yhat,B); % Ones’-Complement to Sign-Mag plot(x,x,’g’,x,xhat,’r’,’linewidth’,1); % Plot
The resulting plots of x and ˆ x are shown in Figure 6.26. Note that the plot of ˆ x is identical to the plot in Figure 6.25 and that it satisfies (6.50) and (6.51).
Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Chapter 6
IMPLEMENTATION OF DISCRETE-TIME FILTERS
Two’s-complement format Once again, for x ≥ 0, we have
(6.52) For x < 0, the representation is given by 2 − |x| where |x| is the magnitude.
−2 − B ≤e T ≤ 0, for x ≥ 0
Hence the magnitude of x is given by
(6.53) with x = 1 " b 1 b 2 ···b B b B+1 · · ·. After truncation to B bits, we obtain
|x| = 2 − x
Q T [x] = 1 " b 1 b 2 ···b B the magnitude of which is
|Q T [x]| = 2 − Q T [x]
From (6.53) and (6.54)
|Q T [x]| − |x| = x − Q T [x] = 1 " b 1 b 2 ···b B b B+1 ···−1 " b 1 b 2 ···b B
The largest change in magnitude from (6.55) is
(6.56) Since the change in the magnitude is positive, then after truncation Q T [x]
00 · · · 0111 · · · = 2 − B (decimal)
becomes more negative, which means that Q T [x] ≤ x. Hence
−2 − B ≤e
T ≤ 0, for x < 0
(6.57) ! EXAMPLE 6.22 Again consider −1 < x < 1 and B = 2 with the resolution ∆ = 2 − 2 = 0.25.
Using MATLAB, verify the truncation error e T relations given in (6.52) and (6.57).
Solution The MATLAB script uses functions sm2tc and tc2sm, which are explored in Problem P9.4.
x = [-1+2^(-10):2^(-10):1-2^(-10)]; % Sign-Magnitude numbers between -1 and 1 B = 2;
% Select bits for Truncation y = sm2tc(x);
% Sign-Mag to Two’s Complement yhat = fix(y*2^B)/2^B;
% Truncation
xq = tc2sm(yq ); % Two’s-Complement to Sign-Mag plot(x,x,’g’,x,xhat,’r’,’linewidth’,1); % Plot
The resulting plots of x and ˆ x are shown in Figure 6.27. Note that the plot of x is also a staircase graph but is below the x graph and that it satisfies (6.52) ˆ and (6.57).
! Collecting results (6.48)–(6.52), and (6.57) along with in Figures 6.25–
6.27, we conclude that the truncation characteristics for fixed-point arithmetic are the same for the sign-magnitude and the one’s-complement formats but are different for the two’s-complement format.
Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
The Process of Quantization and Error Characterizations 271
1 0.75 x xhat 0.5 0.25
FIGURE 6.27 Truncation error characteristics in the two’s-complement format
Rounding operation In this operation, the real number x is rounded to the nearest representable level, which we will refer to as Q R [x]. In MATLAB, to obtain a B-bit rounding approximation, we have to first
scale the number x up by 2 B , then use the round function on the scaled number, and finally scale the result down by 2 − B . Thus the MAT- LAB statement xhat = round(x*2^B)/2^B; implements the desired operation.
Since the quantization step or resolution is ∆ = 2 − B , the magnitude
of the maximum error is
|e R | max = ∆ = 2 − B (6.58)
Hence for all three formats, the quantizer error due to rounding, denoted by e R , satisfies
− 2 − B ≤e 1 R ≤ 2 − B (6.59)
2 2 ! EXAMPLE 6.23 Demonstrate the rounding operations and the corresponding error characteris-
tics on the signal of Examples 6.20–6.22 using the three formats. Solution
Since the rounding operation assigns values that can be larger than the unquan- tized values, which can create problems for the two’s- and one’s-complement format, we will restrict the signal over the interval [−1, 1 − 2 − B−1 ]. The follow- ing MATLAB script shows the two’s-complement format rounding, but other scripts are similar (readers are encouraged to verify).
Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Chapter 6
IMPLEMENTATION OF DISCRETE-TIME FILTERS
(b) Sign-Magnitude Format
(b) Two-Complement Format 0.75 x
(b) Ones-Complement Format
xhat
x xhat
x xhat
1 − 0.75 − 0.5 − 0.25 x 0 0.25 0.5 0.75 − 1 − 0.75 − 0.5 − 0.25 0 0.25 0.5 0.75 − 1 − 0.75 x − 0.5 − 0.25 x 0 0.25 0.5 0.75
FIGURE 6.28 Rounding error characteristics in the fixed-point representation
B = 2;
% Select bits for Rounding
x = [-1:2^(-10):1-2^(-B-1)]; % Sign-Magnitude numbers between -1 and 1 y = sm2tc(x);
% Sign-Mag to Two’s Complement yq = round(y*2^B)/2^B;
% Rounding
xq = tc2sm(yq); % Two’-Complement to Sign-Mag
The resulting plots for the sign-magnitude, ones-, and two’s-complement formats are shown in Figure 6.28. These plots do satisfy (6.59).
! Comparing the error characteristics of the truncation and rounding
operations given in Figures 6.25 through 6.28, it is clear that the rounding operation is a superior one for the quantization error. This is because the error is symmetric with respect to zero (or equal positive and negative distribution) and because the error is the same across all three formats. Hence we will mostly consider the rounding operation for the floating- point arithmetic as well as for further analysis.
6.7.2 FLOATING-POINT ARITHMETIC In this arithmetic, the quantizer affects only the mantissa M. However,
the number x is represented by M × 2 E where E is the exponent. Hence the quantizer errors are multiplicative and depend on the magnitude of x. Therefore, the more appropriate measure of error is the relative error rather than the absolute error, (Q[x] − x). Let us define the relative error, ε , as
△ Q[x] − x ε =
x Then the quantized value Q[x] can be written as
Q[x] = x + εx = x (1 + ε)
Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s).
Quantization of Filter Coefficients 273
When Q[x] is due to the rounding operation, then the error in the man- tissa is between [− 1 2 − 2 B , 1 2 2 − B ]. In this case we will denote the relative
error by ε R . Then from (6.43), the absolute error, Q R [x] − x = ε R x, is between
− − 2 B 2 E )1 −
≤ε R x≤
2 B 2 E (6.62)
Now for a given E, and since the mantissa is between 1 2 ≤ M < 1 (this is
not the IEEE-754 model), the number x is between
2 E−1 ≤x<2 E (6.63) Hence from (6.62) and using the smallest value in (6.63), we obtain −2 − B ≤ε R ≤2 − B (6.64) This relative error relation, (6.64), will be used in subsequent analysis.