Pearson-like tests

3.2 Pearson-like tests

Pearson’s chi-squared test, applied to contingency tables, is often used in IRT modelling, and it is taken for granted that the test statistic is asymptotically chi- squared distributed. The applicability of this test to complicated models, however, is not trivial, and inconsiderate use may lead to serious errors. A theoretically satisfactory solution was presented by Glas and Verhelst (1989, 1995) who defined a broad class of Pearson-like tests that are asymptotically chi-squared distributed. Unfortunately, the computation of the test statistics is rather compli- cated; see Verhelst and Glas (1995) for a detailed account. In this chapter, only

a brief account will be given on an item-orientated test statistic in the Rasch model, labelled S i . To remain in the general framework of Pearson-like tests, a k – 1 × 2 table is considered, the rows indicating the scores on the test and the columns indicating the quality of the answer, 1 for a correct answer and 0 for a wrong answer for some item i. Zero scores and perfect scores are omitted. See Table

9.1, where O indicates observed frequencies and E expected frequencies. Define p i|s as the proportion of correct answers to item i in the score group of students with score s. And similarly, define ␲ i|s as the theoretical conditional probability (under the model) of a correct response, given that the score equals s. Clearly then, one can write

O s1 =n s p i|s and E s1 =n s ␲ i|s

where n s is the number of students with score s. Using these definitions, the well-known expression for Pearson’s chi-squared statistic can be written as

2 2 2 np 2 s = ( is | − π is | )

is | ∑ )

np s ( is | − π

n s ( 1 1 − π is | )

s is |

np 2 2 =

s ( is | − π is | )

n s π is | ( 1 − π is | ) .

200 Different methodological orientations

Table 9.1 Bivariate frequency table for item i Item response Score

1 0 total 1 O 11 (E 11 )

If the theoretical probabilities ␲ i|s were known exactly, then the test statistic would be asymptotically chi-squared, distributed with k – 1 degrees of freedom; but we only have estimates, and the problem arises because the estimate of ␲ i|s depends on all item parameters, and if we subtract a degree of freedom for each estimated parameter, we would end up with zero degrees of freedom. This shows that the problem is not simple; indeed, it is technically quite involved. Generally speaking, the solution consists in applying a certain correction to the test statistic, which takes into account that the parameters have been estimated from the data. Details can be found in Verhelst and Glas (1995). The corrected statistics (indicated as S i ) are computed for the Rasch model and OPLM in the OPLM software package.

Apart from the theoretical burden to show the correctness of the chi-squared distribution, there is also a practical problem. The theoretical chi-squared distribution is only an approximation to the true distribution of S i , and it is known that the approximation improves as the sample size increases. The practical problem is to know when the approximation is good enough to be useful with finite sample sizes. From research in statistics, it is known that Pearson’s statistic gives odd results if expected frequencies in the table become very small. To avoid such

a situation, Table 9.1 may be condensed by taking some adjacent score groups together (such that observed and expected frequencies in a number of adjacent rows are just summed together). In such a case the scores are grouped into Q groups,

G 1 , . . ., G q , . . ., G Q . For example, the lowest score group G 1 = {1, 2, 3, 4} means that the scores 1 to 4 are taken together to form one single score group. The expression for the approximate statistic S* i is then given by

⎡ 2 ⎤ Q ⎢ ∑ np s ( is | − π is | ) ⎥ ⎢ sG ∈ q ⎥

q = 1 ∑ n s π is | ( 1 − π i | ss ) .

sG ∈ q

If the necessary correction for the estimation of the item parameters is applied, the resulting statistic is (asymptotically) chi-squared distributed with Q – 1 degrees of freedom.

IRT models 201 In the program package OPLM, groups of scores are formed such that the

expected number of correct and incorrect answers is at least five in each group. Extended simulation studies have shown that the distribution of the S i statistics is very well approximated by the chi-squared distribution.

An application: differential item functioning (DIF) Applying an IRT model in an empirical population assumes that the model is

valid in every sub-population in the same way. It may happen, however, that some items function differently in different sub-populations (see Holland and Wainer 1993, for an extensive discussion). Formally, an item is said to show

differential functioning of item i with respect to two populations, P 1 and P 2 , say, if for some ability value ␪ it holds that

(9) For applications in EER it is important to look for items that show DIF with

P(X i = 1|␪, P 1 ) ⫽ P(X i = 1|␪, P 2 ).

respect to important variables, such as gender, SES or method of instruction (Kyriakides and Antoniou 2009). When longitudinal studies are conducted in order to measure the long-term effect of teachers and schools, it is important to look for DIF at different moments of time. If part of the test material has become known between the first and second measurement moment, these items might show DIF in favour of the second measurement moment. If this is not recognized, and the analyses are carried out as if the measurement is valid, this will result in a biased estimate of the trend and may result in an underestimation of the long-term effect of school.

The way DIF is detected in the OPLM package is fairly simple. Suppose item

i has been applied in two cycles of a survey, then (implicitly) two tables such as Table 9.1 are built, and the sum of squares given in (8) is simply added for the two cycles. If the correction due to the estimation of the parameters is applied properly, then the resulting statistic is asymptotically chi-squared distributed with degrees of freedom equal to the total number of score groups (in the two cycles jointly) minus 1. In Figure 9.6 a graphical display is given of the results of such an analysis in the PISA project. The item is a mathematics item administered in the cycles of 2000 and 2003. The results apply to one of the participating countries. The S i statistic for this item is 42.92 with 14 degrees of freedom, and is highly significant.

The horizontal axes in both figures are to be read as ordinal axes. The symbols in the figures (crosses or bullets) indicate the proportion of correct responses in each of the score groups. The middle smooth line represents the predicted proportion (the points are connected by a smoothed line), and the two outer smoothed lines represent an approximate 95 per cent confidence envelope. If the model is true, then the observed proportions should fall (in 95 per cent of the cases) within this envelope. In the two figures, one can see that this is the

202 Different methodological orientations

–.34 –.40 Cycle 1: 2000

–.34 –.40 Cycle 2: 2003 Figure 9.6 DIF of a PISA mathematics item (DIF is with respect to cycle)

case, but on the other hand, there is a systematic difference between the two figures: in the 2000 cycle students perform better than predicted by the model, while in the 2003 cycle the performance is worse than predicted, and this systematic difference is detected by the formal statistical test, which gives a very significant result.