Test for Independence (Categorical Data)

10.12 Test for Independence (Categorical Data)

The chi-squared test procedure discussed in Section 10.11 can also be used to test the hypothesis of independence of two variables of classification. Suppose that we wish to determine whether the opinions of the voting residents of the state of Illinois concerning a new tax reform are independent of their levels of income. Members of a random sample of 1000 registered voters from the state of Illinois are classified as to whether they are in a low, medium, or high income bracket and whether or not they favor the tax reform. The observed frequencies are presented in Table 10.6, which is known as a contingency table.

Table 10.6: 2 × 3 Contingency Table

Income Level

Tax Reform Low Medium High Total For

Against

Total

374 Chapter 10 One- and Two-Sample Tests of Hypotheses

A contingency table with r rows and c columns is referred to as an r × c table (“r × c” is read “r by c”). The row and column totals in Table 10.6 are called marginal frequencies. Our decision to accept or reject the null hypothesis, H 0 , of independence between a voter’s opinion concerning the tax reform and his or her level of income is based upon how good a fit we have between the observed frequencies in each of the 6 cells of Table 10.6 and the frequencies that we would

expect for each cell under the assumption that H 0 is true. To find these expected frequencies, let us define the following events:

L: A person selected is in the low-income level. M: A person selected is in the medium-income level.

H: A person selected is in the high-income level.

F : A person selected is for the tax reform.

A: A person selected is against the tax reform. By using the marginal frequencies, we can list the following probability esti-

Now, if H 0 is true and the two variables are independent, we should have

P (L ∩ F ) = P (L)P (F ) =

P (L ∩ A) = P (L)P (A) =

P (M ∩ F ) = P (M)P (F ) =

P (M ∩ A) = P (M)P (A) =

P (H ∩ F ) = P (H)P (F ) =

P (H ∩ A) = P (H)P (A) =

1000 The expected frequencies are obtained by multiplying each cell probability by

the total number of observations. As before, we round these frequencies to one decimal. Thus, the expected number of low-income voters in our sample who favor the tax reform is estimated to be

10.12 Test for Independence (Categorical Data) 375 when H 0 is true. The general rule for obtaining the expected frequency of any cell

is given by the following formula: (column total) × (row total)

expected frequency =

grand total

The expected frequency for each cell is recorded in parentheses beside the actual observed value in Table 10.7. Note that the expected frequencies in any row or column add up to the appropriate marginal total. In our example, we need to compute only two expected frequencies in the top row of Table 10.7 and then find the others by subtraction. The number of degrees of freedom associated with the chi-squared test used here is equal to the number of cell frequencies that may be filled in freely when we are given the marginal totals and the grand total, and in this illustration that number is 2. A simple formula providing the correct number of degrees of freedom is

v = (r − 1)(c − 1).

Table 10.7: Observed and Expected Frequencies

Income Level

Tax Reform

High Total For

313 1000 Hence, for our example, v = (2 − 1)(3 − 1) = 2 degrees of freedom. To test the

null hypothesis of independence, we use the following decision criterion. Test for Calculate

Independence

2 (o i −e i ) χ 2 = ,

where the summation extends over all rc cells in the r × c contingency table. If χ 2 >χ 2 α with v = (r − 1)(c − 1) degrees of freedom, reject the null hypothesis of independence at the α-level of significance; otherwise, fail to reject the null hypothesis.

Applying this criterion to our example, we find that

From Table A.5 we find that χ 2 0.05 = 5.991 for v = (2 − 1)(3 − 1) = 2 degrees of freedom. The null hypothesis is rejected and we conclude that a voter’s opinion concerning the tax reform and his or her level of income are not independent.

376 Chapter 10 One- and Two-Sample Tests of Hypotheses It is important to remember that the statistic on which we base our decision

has a distribution that is only approximated by the chi-squared distribution. The computed χ 2 -values depend on the cell frequencies and consequently are discrete. The continuous chi-squared distribution seems to approximate the discrete sam- pling distribution of χ 2 very well, provided that the number of degrees of freedom

is greater than 1. In a 2 × 2 contingency table, where we have only 1 degree of freedom, a correction called Yates’ correction for continuity is applied. The corrected formula then becomes

2 (|o i −e i | − 0.5) χ 2 (corrected) = .

If the expected cell frequencies are large, the corrected and uncorrected results are almost the same. When the expected frequencies are between 5 and 10, Yates’ correction should be applied. For expected frequencies less than 5, the Fisher-Irwin exact test should be used. A discussion of this test may be found in Basic Concepts of Probability and Statistics by Hodges and Lehmann (2005; see the Bibliography). The Fisher-Irwin test may be avoided, however, by choosing a larger sample.