Measures of Spread

2.3.2 Measures of Spread

The measures of spread (or dispersion) give an indication of how concentrated a data distribution is. The most usual measures of spread are presented next.

Commands 2.8. SPSS, STATISTICA, MATLAB and R commands used to obtain

measures of spread and shape.

SPSS Analyze; Descriptive Statistics STATISTICA Statistics; Basic Statistics/Tables; Descriptive Statistics

MATLAB iqr(x) ;| range(x) ; std(x) ; var(x) ;

skewness(x) ; kurtosis(x) IQR(x) ; range(x) | sd(x) | var(x)|

R skewness(x) ; kurtosis(x)

2.3.2.1 Range

The range of a dataset is the difference between its maximum and its minimum, i.e.:

R=x max –x min .

The basic disadvantage of using the range as measure of spread is that it is dependent on the extreme cases of the dataset. It also tends to increase with the sample size, which is an additional disadvantage.

2.3.2.2 Inter-quartile range

The inter-quartile range is defined as (see also section 2.2.4):

IQR = x 0.75 −x 0.25 . 2.11

The IQR is less influenced than the range by outliers and extreme cases. It tends also to be less influenced by the sample size (and can either increase or decrease).

2.3.2.3 Variance

The variance of a dataset x 1 , …, x n (sample variance) is defined as:

i = 1 ( x i − x ) /( n − 1 ) . 2.12

2.3 Summarising the Data

The sample variance is the point estimate of the associated random variable variance (see Appendices B and C). It can be interpreted as the mean square deviation (or mean square error, MSE) of the sample values from their mean. The use of the n – 1 factor, instead of n as in the usual computation of a mean, is explained in C.2. Notice also that given x , only n – 1 cases can vary independently in order to achieve the same variance. We say that the variance has df = n – 1 degrees of freedom. The mean, on the other hand, has n degrees of freedom.

2.3.2.4 Standard Deviation

The standard deviation of a dataset is the root square of its variance. It is, therefore,

a root mean square error (RMSE):

s n = v = [ ( x − x ) 2 /( n − 1 ) ] 1 / ∑ 2

The standard deviation is preferable than the variance as a measure of spread, since it is expressed in the same units as the original data. Furthermore, many interesting results about the spread of a distribution are expressed in terms of the standard deviation. For instance, for any random variable X, the Chebyshev Theorem tall us that (see A.6.3):

Using s as point estimate of σ, we can then expect that for any dataset distribution at least 75 % of the cases lie within 2 standard deviations of the mean.

Example 2.6

Q: Consider the Cork Stoppers’ dataset. Determine the measures of spread of the variable PRT. Imagine that we had a new variable, PRT1, obtained by the following linear transformation of PRT: PRT1 = 0.2 PRT + 5. Determine the variance of PRT1.

A: Table 2.7 shows measures of spread of the variable PRT. The sample variance enjoys the same linear transformation property as the true variance (see A.6.1). For the PRT1 variable we have:

variance(PRT1) = (0.2) 2 variance(PRT) = 5219.

Note that the addition of a constant to PRT (i.e., a scale translation) has no effect on the variance.

64 2 Presenting and Summarising the Data

Table 2.7. Spread measures (computed with STATISTICA) for variable PRT of the cork stopper dataset (150 cases).

Range Inter-quartile range Variance Standard Deviation