The Chi-Square Test of Independence
5.2.3 The Chi-Square Test of Independence
When performing tests of hypotheses one often faces the situation in which a decision must be made as to whether or not two or more variables pertaining to the same population can be considered independent. In order to assess the independency of two variables we use the contingency table formalism, which now, however, is applied to only one population whose variables can be categorised into two or more categories. The variables can either be discrete
196 5 Non-Parametric Tests of Hypotheses
(nominal or ordinal) or continuous. In this latter case, one must choose suitable categorisations for the continuous variables.
The r × c contingency table for this situation is the same as shown in Figure 5.4. The only differences being that whereas in the previous section the rows represented different populations and the row totals were assumed to be fixed, now the rows represent categories of a second variable and the row totals can vary arbitrarily, constrained only by the fact that their sum is the total number of cases.
The test is formalised as:
H 0 : The event “an observation is in row i” is independent of the event “the same observation is in column j ”, i.e.:
P(row i, column j) = P(row i) × P(column j), ∀i,j.
H 1 : The events “an observation is in row i ” and “the same observation is in column j”, are dependent, i.e.:
∃ i,j, P(row i, column j) ≠ P(row i) × P (column j).
Let r i denote the row totals as in Figure 2.18, such that:
r i = ∑ O ij and n = r 1 +r 2 + ...+ r r =c 1 +c 2 + ... + c c .
As before, we use the test statistic:
c ( 2 O ij − E ij )
T = ∑∑
5.24 i = 1 j = 1 E ij
, with E ij =
which has the asymptotic chi-square distribution with df = (r – 1)(c – 1) degrees of freedom. Note, however, that since the row totals can vary in this situation, the exact probability associated to a certain value of T is even more difficult to compute than before because there are a greater number of possible tables with the same T.
Example 5.12
Q: Consider the Programming dataset, containing results of pedagogical enquiries made during the period 1986-1988, of freshmen attending the course “Programming and Computers” in the Electrotechnical Engineering Department of Porto University. Based on the evidence provided by the respective samples, is it possible to conclude that the performance obtained by the students at the final examination is independent of their previous knowledge on programming?
A: Note that we have a single population with two attributes: “previous knowledge on programming” (variable PROG), and “final examination score” (variable SCORE). In order to test the independence hypothesis of these two attributes, we
5.2 Contingency Tables 197
first categorise the SCORE variable into four categories. These can be classified as: “Poor” corresponding to a final examination score below 10; “Fair” corresponding to a score between 10 and 13; “Good” corresponding to a score between 14 and 16; “Very Good” corresponding to a score above 16. Let us call PERF (performance) this new categorised variable.
The 3 × 4 contingency table, using variables PROG and PERF, is shown in Table
5.13. Only two (16.7%) cells have expected counts below 5; therefore, the recommended conditions, mentioned in the previous section, for using the asymptotic distribution of T, are met.
The value of T is 43.044. The asymptotic chi-square distribution of T has (3 – 1)(4 – 1) = 6 degrees of freedom. At a 5% level, the critical region is above
12.59 and therefore the null hypothesis is rejected at that level. As a matter of fact, the observed significance of T is p ≈ 0.
Table 5.13. The 3 × 4 contingency table obtained with SPSS for the independence test of Example 5.12.
PERF Total Very
Poor Fair Good Good PROG 0 Count
76 78 16 7 177 Expected Count
1 Count
19 29 10 13 71 Expected Count
2 Count
2 6 7 8 23 Expected Count
8.2 9.6 2.8 2.4 23.0 Total Count
97 113 33 28 271 Expected Count
The chi-square test of independence can also be applied to assess whether two or more groups of data are independent or can be considered as sampled from the same population. For instance, the results obtained for Example 5.7 can also be interpreted as supporting, at a 5% level, that the male and female groups are not independent for variable Q7; they can be considered samples from the same population.