Three classes of statistical tests

3.3 Three classes of statistical tests

To exemplify the three classes of model tests, we take the preceding example of DIF as a starting point. Suppose the item displayed in Figure 9.6, henceforth called the target item, is the only item for which there is DIF and that for the other items the model as specified is valid. As the data have been analysed with OPLM, we assume also that the discrimination parameters for all items are well

IRT models 203 specified, and that for the Figure 9.6 target item, the discrimination parameter

is valid for the two cycles. This means that in the specification of the model, there was only one error: the difficulty of the target item is different in the two cycles. One could in principle cope with this situation by considering the target item conceptually as different items in the two cycles (with possibly two different difficulty parameters), and then the model would be a correct description of the reality. The model that has been applied, however, represents a restriction on the parameters of the general model in that it requires that the difficulty parameter of these two conceptual items be equal in the two cycles. So the model as applied imposes a restriction on the parameter space of the general or encompassing model.

In statistical terms, the null hypothesis of the statistical test is the restricted model, while the encompassing model is the alternative hypothesis. There are three ways of testing such hypotheses, which are asymptotically equivalent but which imply different procedures. These tests are likelihood ratio tests, Wald- type tests and Lagrange multiplier tests. These tests are discussed below and are also discussed in relation to Structural Equation Modelling in Chapter 12.

Likelihood ratio tests In this class of tests, the parameters are estimated (by maximum likelihood)

under both the general model and the restricted model, and their maximal values under both models are compared. The test statistic is

L * LR = −2ln r ,

(10) L * g

where the ‘*’ indicates that the value of the likelihood function has to be taken at its maximum. The subscript g stands for the general model and the subscript r for the restricted model. LR is asymptotically chi-squared distributed and the number of degrees of freedom equals the number of restrictions that were imposed to specify the restricted model.

In the example of the DIF item, this would mean that we would have to estimate the parameters twice: once in a model where the target item is treated as identical in the two cycles and once where it is treated as two different items. The LR-test would give a test statistic with one degree of freedom. However, if this procedure has to be applied for each item, then the number of estimation procedures would be one plus the number of common items in both cycles.

Wald-type tests To apply this class of statistical tests, the parameters of the model have to be

estimated under the general model. In the DIF example this means that the difficulty parameter for the target item has to be estimated as a different parameter

204 Different methodological orientations in both cycles. Denote these parameters as ␤ i1 and ␤ i2 respectively. Then the

restricted model, which is the null hypothesis, states that

H 0 :␤ i1 –␤ i2 = 0, (11) and if this hypothesis is true, then it may be expected that the estimates of both

parameters are reasonably close to each other. The test statistic is just the squared difference between the two estimates divided by the (estimated) variance of the difference, that is:

i 1 + SE (ˆ) β i 2 − 2 Cov (ˆ,ˆ) ββ i 1 i 2

The right-hand side of the preceding equation makes clear that the estimates of both parameters are correlated in general, and that one has to take the covariance of the estimates into account when computing the test statistic. W i is asymptotically chi-squared distributed with one degree of freedom.

This example is a bit artificial because it reflects a procedure where one wants to test DIF only for a single item, while in the construction of a measurement model one would usually want to investigate DIF for all items. In such a case the general model states that in the two cycles all item parameters could possibly have different values, and estimating the difficulty parameters under this model amounts to estimating the parameters separately from the data of the two cycles. One can then test the null hypothesis (11) for each item in turn, and the test statistic is still given by (12) but in this case the covariance term vanishes because item parameters have been estimated from independent samples. It is also possible to estimate all these hypotheses jointly. The test statistic in this case is

(13) where ␤ j (j = 1, 2) denotes the vector of parameter estimates and ⌺ denotes the

W = (␤ˆ 1 – ␤ˆ 2 )⬘(⌺ 1 +⌺ 2 ) –1 (␤ˆ 1 – ␤ˆ 2 ),

(estimated) variance-covariance matrix of the estimates. The test statistic is asymptotically chi-squared distributed and the degrees of freedom are equal to the number of restrictions implied by the null hypothesis.

To make results of surveys comparable across cycles, the tests administered in both cycles must have some items in common, but usually they also contain unique material. Of course, the W-statistic to detect DIF can only be applied to the common items; suppose there are m of them. Furthermore, assume parameters have been estimated separately for the two cycles. However, this means that in the two estimation procedures the normalization is free, and one can always choose two normalizations such that the W-statistic takes an arbitrary, large value, for example, by setting the average of the common parameters in the first cycle equal to zero and in the second cycle to an arbitrary non-zero value. Therefore one must choose a normalization such that the estimates are

IRT models 205 meaningfully comparable across the two cycles. A good way of accomplishing

this is to make the sum of the parameters of the common items in both cycles equal to each other. The number of degrees of freedom for the test is then m – 1.

Lagrange multiplier tests In this class of tests, parameters are estimated only in the restricted model (that

is, assuming that the null hypothesis is true). In the DIF example, this means that item parameters are estimated jointly from the data of the two cycles. The idea behind the test procedure is that at the maximum of the likelihood function, the change of the function with respect to the unrestricted parameters will be small and hence that the partial derivative of the (log-)likelihood function with respect to the unrestricted parameters will be close to zero. It has been shown that the Pearson-like tests (with proper correction for the fact that the parameters are estimated from the data) are test procedures of this class. The advantage of the Pearson-like approach, however, is that one does not need to write down explicitly the likelihood function for the general model but that one can suffice with the specification of one or more contrast vectors. In the case of the DIF example, this amounts to specifying the target item and indicating for each observed response pattern in which cycle it has been observed. A more compli- cated example to test the unidimensionality assumption in the Rasch model is discussed in detail in Verhelst (2001).