Conditional maximum likelihood (CML) estimation

2.3 Conditional maximum likelihood (CML) estimation

When one is interested primarily in the measurement model, the abilities of the students in the sample are a kind of nuisance, and therefore they are sometimes called ‘nuisance parameters’.

Assuming a certain distribution in the population and using MML estimation is one way of getting rid of these nuisance parameters. Another way, which can only be applied with exponential family models, is conditional maximum likelihood estimation. The principle is that one maximizes the probability of the observed data (the likelihood), conditional on sufficient statistics for the nuisance parameters.

To gain an impression of how this works, an example is given for the Rasch model with k = 3 and the score s = 2. It is easy to see that for a student with ability ␪, the conditional probability of obtaining the response pattern (0, 1, 1) is given by

P [( , , ) | ] 011 θ

P [( , , ) | , 011 θ s = 2 ] = , (5) P [( , , ) | ] 011 θ + P [( , , ) | ] 101 θ + P (,,)|] 110 θ

194 Different methodological orientations where the denominator is the sum over all response patterns yielding a score of

2. To obtain a less complicated expression, a simple reparameterization of the model is introduced. We define

␧ i = exp(–␤ i ). Using this, one can write

exp( ) 2 θ

P [( , , ) | ] 011 θ =

i exp( )] θ

where the right-hand side consists of a fraction multiplied by a product of item parameters. It is easy to check that the same fraction will appear in all probabilities in the denominator of (5), from which one obtains immediately the very important result:

where the right-hand side is independent of ␪ and is only a function of the item parameters. Equation (6), considered as a function of the item parameters, is called the conditional likelihood of the response pattern (0, 1, 1). The conditional likelihood of a data set is just the product of the conditional likelihood of all response patterns. The conditional maximum likelihood (CML) estimates are the values of the parameters that maximize this product (or its logarithm).

This method was proposed by Rasch (1960) and is to be considered as a great discovery. Note that the condition in (6) is the test score, and it is only by conditioning on the test score that the conditional likelihood is independent of ␪. This result can be generalized, however, to other models: in the Rasch model, this independence is obtained because the test score is the sufficient statistic for ␪. The generalization then amounts to the statement that by CML one can get rid of the nuisance parameters if they have a sufficient statistic and if one conditions on these statistics.

Andersen (1973) has shown that CML yields consistent estimates under very broad conditions. Software that allows one to implement this method for the Rasch model includes OPLM (Verhelst, Glas and Verstralen 1994) and the eRm package in R (Mair and Hatzinger 2007). The method is easily applicable in incomplete designs (Molenaar 1995). For the model to be identified, the design must be linked.

The important theoretical advantage of using CML is that the estimates are consistent independently of the way the sample has been drawn. There is no requirement whatsoever to draw representative samples, and the method is applicable under multiple stage sampling. Application in longitudinal studies is also perfectly possible, as the only assumption that is made is that the ability

IRT models 195 of the student is constant for all the item responses he or she has given. This

means that in longitudinal studies, students having taken part in the study at two or more occasions are formally treated as different students at every testing occasion. This so-called sampling independence is an important theoretical advan - tage that is sometimes incorrectly used. Here are two comments on this:

• It should be clear that the advantages of the CML method only apply if the model is valid; they do not follow from the mechanical application of a computational routine. The validity of the model has to be tested carefully, and one has to be careful with generalizations. Suppose an achievement test has been validated using the Rasch model in some stable setting of the educational conditions (for example, in schools of a specific local educational authority, or schools that use a specific curriculum). This implies that if the curriculum changes drastically at some point, it does not follow that the test remains valid in the same way as before the reorganization. It is an (important) empirical question if it does or does not, and a justification based on the result of sampling independence is not justified.

• The principle of sampling independence does not imply – even if the model is valid – that all samples are equally well suited for estimation purposes. The accuracy of the estimates depends on the amount of statistical infor- mation that is collected, and this in turn depends on the sample size and on the match between student ability and item difficulty. Loosely speaking, this means that one collects the maximal information on an item parameter from a student’s response if the probability of a correct response is 50 per cent. Conversely, if an item is too difficult or too easy relative to the ability of the tested student, one collects little information, and the estimates will

be less accurate than with a good match between difficulty and ability. Of all the models introduced in Chapter 8, the Rasch model and the partial

credit model are the only two models where CML estimation of the item parameters is possible. The Rasch model, however, is quite strict in its assump- tions, and in empirical applications the requirement of equal item discriminations is often not attained, unless the development of the test is based on the assumption that each item should be able to discriminate between students. In the 2PLM, the weighted score, with the discrimination parameters as weights, is a sufficient statistic, but to condition on it, the weighted score must be known. If we treat the discrimination parameters as if they were known then we fix their values by hypothesis. Thus, in the 2PLM specialized to this particular case, CML is possible in principle. Since of the two parameters per item, one has been fixed, there remains only one parameter to be estimated for each item; hence the name One Parameter Logistic Model (OPLM; Verhelst and Eggen 1989; Verhelst and Glas 1995). Applying the same rationale to the generalized partial credit model (GPCM) also makes CML possible.

196 Different methodological orientations The existence of sufficient statistical is necessary for CML to be possible, but

it is not the only condition that must be satisfied, as the following example shows. Assume k = 3, and the discrimination parameters are fixed at 1, ␲ and e, respectively. For all students having two of the three items correct, their weighted score is (1 + ␲), (1 + e) or (␲ + e), and these three numbers are different from each other. An analogous result holds for students having zero, one or all three items correct. So, there is a one-to-one relation between response patterns and weighted score, meaning that from the weighted score one can deduce with certainty the response pattern, or that it holds that P(x|s) = 1, independently of the item parameters. More generally said, the sufficient statistics do not lead to a reduction of the data: they can assume as many different values as there are response patterns, and therefore the conditional likelihood function is constant and has no maximum.

To ensure that there is sufficient reduction, in the software package OPLM the discrimination parameters must be fixed at integer values in the range [1, 15]. Years of experience with the program have shown that in most cases, unique estimates of the item parameters are obtained. A general theoretical result that describes when the estimates exist or do not, however, is not available.