Measures of Association for Nominal Variables
2.3.6 Measures of Association for Nominal Variables
Assume we have a multivariate dataset whose variables are of nominal type and we intend to measure their level of association. In this case, the correlation coefficient approach cannot be applied, since covariance and standard deviations are not applicable to nominal data. We need another approach that uses the contingency table information in a similar way as when we computed the gamma coefficient for the ordinal data.
Commands 2.11. SPSS, STATISTICA, MATLAB and R commands used to
obtain measures of association for nominal variables.
SPSS Analyze; Descriptive
Statistics; Crosstabs STATISTICA Statistics; Basic Statistics/Tables; Tables and Banners; Options
MATLAB kappa(x,alpha) R
kappa(x,alpha)
Measures of association for nominal variables are obtained in SPSS and STATISTICA as a result of applying contingency table analysis (see Commands 5.7).
The kappa statistic can be computed with SPSS only when the values of the first variable match the values of the second variable. STATISTICA does not provide the kappa statistic.
MATLAB Statistics toolbox and R stats package do not provide a function for computing the kappa statistic. We provide, however, MATLAB and R functions for that purpose in the book CD (see Appendix F).
2.3.6.1 The Phi Coefficient
Let us first consider a bivariate dataset with nominal variables that only have two values (dichotomous variables), as in the case of the 2 × 2 contingency table shown in Table 2.11.
In the case of a full association of both variables one would obtain a 100% frequency for the values along the main diagonal of the table, and 0% otherwise. Based on this observation, the following index of association, φ (phi coefficient), is defined:
ad − bc φ =
( a + b )( c + d )( a + c )( b + d )
74 2 Presenting and Summarising the Data
Note that the denominator of φ will ensure a value in the interval [−1, 1] as with the correlation coefficient, with +1 representing a perfect positive association and –1 a perfect negative association. As a matter of fact the phi coefficient is a special case of the Pearson correlation.
Table 2.11.
A general cross table for the bivariate dichotomous case. y 1 y 2 Total
x 1 a b a+b x 2 cd c + d Total
a+c
b+d
a+b+c+d
Example 2.9
Q: Consider the 2 × 2 contingency table for the variables SEX and INIT of the Freshmen dataset, shown in Table 2.12. Compute their phi coefficient.
A: The computed value of phi using 2.26 is 0.15, suggesting a very low degree of association. The significance of the phi values will be discussed in Chapter 5.
Table 2.12. Cross table (obtained with SPSS) of variables SEX and INIT of the freshmen dataset.
INIT Total
yes
no
SEX male Count
91 5 96 % of Total
30 5 35 % of Total
26.7% Total Count
121 10 131 % of Total
2.3.6.2 The Lambda Statistic
Another useful measure of association, for multivariate nominal data, attempts to evaluate how well one of the variables predicts the outcome of the other variable. This measure is applicable to any nominal variables, either dichotomous or not. We will explain it using Table 2.4, by attempting to estimate the contribution of variable SEX in lowering the prediction error of Q4 (“liking to be initiated”). For that purpose, we first note that if nothing is known about the sex, the best prediction of the Q4 outcome is the “agree” category, the so-called modal category,
2.3 Summarising the Data
with the highest frequency of occurrence (37.9%). In choosing this modal category, we expect to be in error 62.1% of the times. On the other hand, if we know the sex (i.e., we know the full table), we would choose as prediction outcome the “agree” category if it is a male (expecting then 73.5 – 28 = 45.5% of errors), and the “fully agree” category if it is a female (expecting then 26.5 – 11.4 = 15.1% of errors).
Let us denote:
i. Pe c ≡ Percentage of errors using only the columns = 100 – percentage of modal column category.
ii. Pe cr ≡ Percentage of errors using also the rows = sum along the rows of (100 – percentage of modal column category in each row).
The λ measure (Goodman and Kruskal lambda) of proportional reduction of error, when using the columns depending from the rows, is defined as:
Pe c − Pe λ cr =
cr . 2.27 Pe c
Similarly, for the prediction of the rows depending from the columns, we have:
Pe r − Pe rc λ rc =
. 2.28 Pe r
The coefficient of mutual association (also called symmetric lambda) is a weighted average of both lambdas, defined as:
average reduction in errors ( Pe c − Pe cr ) + ( Pe
r − Pe rc )
average number of errors
Pe c + Pe r
The lambda measure always ranges between 0 and 1, with 0 meaning that the independent variable is of no help in predicting the dependent variable and 1 meaning that the independent variable perfectly specifies the categories of the dependent variable.
Example 2.10
Q: Compute the lambda statistics for Table 2.4.
A: Using formula 2.27 we find λ cr = 0.024, suggesting a non-helpful contribution of the sex in determining the outcome of Q4. We also find λ rc = 0 and λ = 0.017. The significance of the lambda statistic will be discussed in Chapter 5.
2.3.6.3 The Kappa Statistic
The kappa statistic is used to measure the degree of agreement for categorical variables. Consider the cross table shown in Figure 2.19 where the r rows are
76 2 Presenting and Summarising the Data
objects to be assigned to one of c categories (columns). Furthermore, assume that k judges assigned the objects to the categories, with n ij representing the number of judges that assigned object i to category j.
The sums of the counts along the rows totals k. Let c j denote the sum of the counts along the column j. If all the judges were in perfect agreement one would find a column filed in with k and the others with zeros, i.e., one of the c j would be rk and the others zero. The proportion of objects assigned to the jth category is:
p j = c j /( rk ) .
If the judges make their assignments at random, the expected proportion of
agreement for each category is p 2 j and the total expected agreement for all
categories is:
P ()∑ E = p j . 2.30
The extent of agreement, s i , concerning the ith object, is the proportion of the number of pairs for which there is agreement to the possible pairs of agreement:
s i = ∑ / .
c n ij k
j = 1 2 2
The total proportion of agreement is the average of these proportions across all objects:
P ( A ) = ∑ s i . 2.31
The κ (kappa) statistic, based on the formulas 2.30 and 2.31, is defined as:
P ()() A − P E
1 − P () E
If there is complete agreement among the judges, then κ = 1 (P(A) = 1, P(E) = 0). If there is no agreement among the judges other than what would be
expected by chance, then κ = 0 (P(A) = P(E)).
Example 2.11
Q: Consider the FHR dataset, which includes 51 foetal heart rate cases, classified by three human experts (E1C, E2C, E3C) and an automatic diagnostic system (SPC) into three categories: normal (0), suspect (1) and pathologic (2). Determine the degree of agreement among all 4 classifiers (experts and automatic system).
Exercises 77
A: We use the N, S and P variables, which contain the data in the adequate contingency table format, shown in Table 2.13. For instance, object #1 was classified N by one of the classifiers (judges) and S by three of the classifiers.
Running the function kappa(x,0.05) in MATLAB or R, where x is the data matrix corresponding to the N-S-P columns of Table 2.13, we obtain κ = 0.213, which suggests some agreement among all 4 classifiers. The significance of the kappa values will be discussed in Chapter 5.
Table 2.13. Contingency table for the N, S and P categories of the FHR dataset. Object #