Measures of Location
2.3.1 Measures of Location
Measures of location are used in order to determine where the data distribution is concentrated. The most usual measures of location are presented next.
Commands 2.7. SPSS, STATISTICA, MATLAB and R commands used to obtain measures of location.
SPSS Analyze; Descriptive Statistics STATISTICA Statistics; Basic Statistics/Tables; Descriptive Statistics MATLAB
mean(x) ; trimmean(x,p) ; median(x) ; prctile(x,p)
R mean(x, trim) ; median(x) ; summary(x); quantile(x,seq(...))
2.3.1.1 Arithmetic Mean
Let x 1 , …, x n
be the data. The arithmetic mean (or simply mean) is:
x i . 2.5
The arithmetic mean is the sample estimate of the mean of the associated random variable (see Appendices B and C). If one has a tally sheet of a discrete
2.3 Summarising the Data
type data, one can also compute the mean using the absolute frequencies (counts), n k , of each distinct value x k :
x n = ∑ k = 1 n k x k with n = ∑ k = 1 n k . n 2.6
If one has a frequency table of a continuous type data (also known in some literature as grouped data), with r bins, one can obtain an estimate of x , using the frequencies f j of the bins and the mid-bin values, x& j , as follows:
j = 1 f j x & j . r 2.7
This mean estimate used to be presented as an expedite way of calculating the arithmetic mean for long tables of data. With the advent of statistical software the interest of such a method is at least questionable. We will proceed no further with such a “grouped data” approach.
Sometimes, when in presence of datasets exhibiting outliers and extreme cases (see 2.2.4) that can be suspected to be the result of rough measurement errors, one can use a trimmed mean by neglecting a certain percentage of the tail cases (e.g., 5%).
The arithmetic mean is a point estimate of the expected value (true mean) of the random variable associated to the data and has the same properties as the true mean (see A.6.1). Note that the expected value can be interpreted as the center of gravity of a weightless rod with probability mass-points, in the case of discrete variables, or of a rod whose mass-density corresponds to the probability density function, in the case of continuous variables.
2.3.1.2 Median
The median of a dataset is that value of the data below which lie 50% of the cases. It is an estimate of the median, med(X), of the random variable, X, associated to the data, defined as:
F X ( x ) = ⇒ med ( X ) , 2.8
where F X ( x ) is the distribution function of X. Note that, using the previous rod analogy for the continuous variable case, the median divides the rod into equal mass halves corresponding to equal areas under the density curve:
med( X ) ∞
f X ( x ) = ∫ med( X ) f X ( x ) = .
60 2 Presenting and Summarising the Data
The median satisfies the same linear property as the mean (see A.6.1), but not the other properties (e.g. additivity). Compared to the mean, the median has the advantage of being quite insensitive to outliers and extreme cases.
Notice that, if we sort the dataset, the sample median is the central value if the number of the data values is odd; if it is even, it is computed as the average of the two most central values.
2.3.1.3 Quantiles
The quantile of order α (0 < α < 1) of a random variable distribution F X ( x ) is
defined as the root of the equation (see A.5.2):
F X ( x ) = α . 2.9
We denote the root as: x α . Likewise we compute the quantile of order α of a dataset as the value below
which lies a percentage α of cases of the dataset. The median is therefore the 50% quantile, or x 0.5 . Often used quantiles are:
– Quartiles, corresponding to multiples of 25% of the cases. The box plot mentioned in 2.2.4 uses the quartiles and the inter-quartile range (IQR) in order to determine the outliers of the dataset distribution.
– Deciles, corresponding to multiples of 10% of the cases. – Percentiles, corresponding to multiples of 1% of the cases. We will often
use the percentile p = 2.5% and its complement p = 97.5%.
2.3.1.4 Mode The mode of a dataset is its maximum value. It is an estimate of the probability or
density function maximum. For continuous type data one should determine the midpoint of the modal bin of the data grouped into an appropriate number of bins. When a data distribution exhibits several relative maxima of almost equal value, we say that it is a multimodal distribution.
Example 2.5
Q: Consider the Cork Stoppers’ dataset. Determine the measures of location of the variable PRT. Comment the results. Imagine that we had a new variable, PRT1, obtained by the following linear transformation of PRT: PRT1 = 0.2 PRT + 5. Determine the mean and median of PRT1.
A: Table 2.6 shows some measures of location of the variable PRT. Notice that as
a mode estimate we can use the midpoint of the bin [355.3 606.7] as shown in Figure 2.17, i.e., 481. Notice also the values of the lower and upper quartiles
2.3 Summarising the Data 61
delimiting 50% of the cases. The large deviation of the 95% percentile from the upper quartile, when compared to the deviation of the 5% percentile from the lower quartile, is evidence of a right skewed asymmetrical distribution.
By the linear properties of the mean and the median, we have:
Mean(PRT1) = 0.2 Mean(PRT) + 5 = 147; Median(PRT1) = 0.2 Median(PRT) + 5 = 131.
Table 2.6. Location measures (computed with STATISTICA) for variable PRT of the cork stopper dataset (150 cases).
Mean Median Lower
Upper
Percentile Percentile
An important aspect to be considered, when using values computed with statistical software, is the precision of the results expressed by the number of significant digits. Almost every software product will produce results with a large number of digits, independent of whether or not they mean something. For instance, in the case of the PRT variable (Table 2.6) it would be foolish to publish that the mean of the total perimeter of the defects of the cork stoppers is 710.3867. First of all, the least significant digit is, in this case, the unit (no perimeter can be measured in fractions of the pixel unit; see Appendix E). Thus, one would have to publish a value rounded up to the units, in this case 710. Second, there are omnipresent measurement errors that must be accounted for. Assuming that the 3 perimeter measurement error is of one unit, then the mean is 710 ±1 . As a matter of fact, even this one unit precision for the mean is somewhat misleading, as we will see in the following chapter. From now on the published results will take this issue into consideration and may, therefore, appropriately round the results obtained with the software products.
The R functions also provide a large number of digits, as when calculating the mean of PRT:
> mean(PRT) [1] 710.3867
However, the summary function provides a reasonable rounding:
> summary(PRT) Min. 1st Qu. Median Mean 3rd Qu. Max. 104.0 412.0 629.0 710.4 968.5 1612.0
Denoting by ∆x a single data measurement error, the mean of n measurements has an error of ±(n.abs(∆x))/n = ±∆x in the worst case.
62 2 Presenting and Summarising the Data