Concluding remarks

7 Concluding remarks

The overview of IRT models presented in the preceding sections may leave an impression of abundance and therefore of confusion: what to choose in a concrete situation and how to decide if the choice made is the best possible one. It may look as if the choice of the most complicated model is the best – it avoids as many pitfalls and shortcomings as possible – and after the analysis it will show ‘automatically’ whether a simpler model is applicable.

Such a conception, however, is mistaken for two reasons. First, working with complicated models usually causes problems in the parameter estimation, which

Using Item Response Theory 181 may be of a technical nature (the software ‘does not find’ the estimates) but

also of a statistical nature: the estimates will tend to have large standard errors and in many cases be highly correlated. In practice this means that a small change in the data may lead to quite large shifts in the estimated values, which in turn may require a substantial change in the interpretation of the obtained results.

The second misconception has to do with a more general approach to scientific problems, where it is sometimes thought that complicated mathematical models will reveal the true structure of the world. However, the real key to understanding the determinants of educational performance is the setting up of clever and well-considered research programmes, with directed and specific hypotheses that can be tested empirically. The use of IRT models in testing these hypotheses is certainly recommended, but the decision as to which kind of data to collect under which conditions is a prerequisite for good research.

In the national assessment programme for basic education in the Netherlands, started in 1987 for arithmetic, it was decided from the onset that the analyses would be carried out at the level of minimal curricular units, that is, units described in the curriculum that were thought to be homogeneous enough in content, didactic approach and thinking processes, such that the responses to items belonging to the same unit could be described by a simple unidimensional IRT model. Such an approach, which required a careful design and a rather elaborate construction process of the item material, has proven to be useful. A trend analysis of the results of four waves of the assessment showed a rather dramatic decrease in the performance of the operations of multiplication and division. It is very unlikely that such a trend would have been found if the test had consisted of a ‘well-balanced’ mixture of material, covering the whole curriculum but not fine grained enough to draw conclusions in any specific domain. This implies that researchers in the area of educational effectiveness should make use of IRT models to develop psychometrically appropriate scales, but in order to do so they should seriously take into account the theoretical background upon which a test or an instrument measuring a specific factor has been developed. For example, the partial credit model may be found useful in analysing data emerging from a high-inference observation instrument, and thereby a valid measure of quality of teaching may emerge (Kyriakides et al. 2009). Another issue that researchers in the area of EER should take into account has to do with the fact that IRT can be applied in incomplete designs that are very likely to be used for measuring student achievement in longitudinal studies. In this context, issues concerned with the use of different designs in applying IRT and parameter estimation are discussed in the next chapter.

References

Agresti, A. (1990) Categorical data analysis, New York, NY: Wiley. Andersen, E.B. (1977) ‘Sufficient statistics and latent trait models’, Psychometrika, 42(1):

182 Different methodological orientations Bock, R.D. (1972) ‘Estimating item parameters and latent ability when responses are

scored in two or more nominal categories’, Psychometrika, 37: 29–51. Bock, R.D. and Lieberman, M. (1970) ‘Fitting a response model for n dichotomously scored items’, Psychometrika, 35: 179–97. Christoffersson, A. (1975) ‘Factor analysis of dichotomized variables’, Psychometrika, 40: 5–22. Fischer, G.H. (1974) Einführung in die Theorie Psychologischer Tests [Introduction into the theory of psychological tests], Bern: Huber. Fischer, G.H. (1995) ‘Derivations of the Rasch model’, in G.H. Fischer and I.W. Molenaar (eds) Rasch models: Foundations, recent developments and applications, New York: Springer, pp. 39–52.

Glas, C.A.W. and Verhelst, N.D. (1995) ‘Tests of fit for polytomous Rasch models’, in G.H. Fischer and I.W. Molenaar (eds) Rasch models: Foundations, recent developments and applications, New York: Springer, pp. 325–52. Guttman, L.A. (1950) ‘The basis of scalogram analysis’, in S.A. Stouffer, L.A. Guttman, E.A. Sachman, P.F. Lazarsfeld, S.A. Star and J.A. Clausen (eds) Measurement and prediction: Studies in social psychology in World War II, Vol 4, Princeton, NJ: Princeton University Press.

Hendrickson, A.B. and Mislevy, R.J. (2005) ‘Item response theory (IRT): Cognitive models’, in B.S. Everitt and D.C. Howell (eds) Encyclopedia of statistics in behavioral science, Volume 2, Chichester: Wiley, pp. 978–82.

Kelderman, H. and Rijkes, C.P.M. (1994) ‘Computing maximum likelihood estimates of loglinear IRT models from marginal sums’, Psychometrika, 57: 437–50. Kyriakides, L., Creemers, B.P.M. and Antoniou, P. (2009) ‘Teacher behaviour and student outcomes: Suggestions for research on teacher training and professional development’, Teaching and Teacher Education, 25(1): 12–23.

Lord, F.M. and Novick, M.R. (1968) Statistical theories of mental test scores, Reading: Addison-Wesley. Maris, E. (1995) ‘Psychometric latent response models’, Psychometrika, 60: 523–48. Masters, G.N. (1982) ‘A Rasch model for partial credit scoring’, Psychometrika, 47:

149–74. Muraki, E. (1992) ‘A generalized partial credit model: Application of an EM algorithm’, Applied Psychological Measurement, 16: 159–76. Muthén, B.O. (1978) ‘Contributions to factor analysis of dichotomous variables’, Psychometrika, 43: 551–60. Rasch, G. (1960) Probabilistic models for some intelligence and attainment tests, Copen - hagen: Danish Institute for Educational Research. Samejima, F. (1969) ‘Estimation of latent ability using a pattern of graded scores’, Psycho - metrika, Monograph Supplement, No. 17. Samejima, F. (1972) ‘A general model for free response data’, Psychometrika, Monograph Supplement, No. 18. Samejima, F. (1973) ‘Homogeneous case of the continuous response model’, Psycho - metrika, 38: 203–19. Takane, Y. and De Leeuw, J. (1987) ‘On the relationship between item response theory and factor analysis of discretized variables’, Psychometrika, 52: 393–408. van Leeuwe, J.F.J. and Roskam, E.E. (1991) ‘The conjunctive item response model:

A probabilistic extension of the Coombs and Kao model’, Methodika, 5: 14–32.

Chapter 9

IRT models

Parameter estimation, statistical testing and application in EER

Norman Verhelst

CITO, The Netherlands

As was indicated in Chapter 8, the most important advantage of using IRT is the possibility of applying it in incomplete designs. This does not mean, however, that there are no restrictions on the test designs that can be used in connection with IRT. For this reason, the first section of this chapter is concerned with important features of designs that can be used with IRT models. The next section refers to the problem of parameter estimation. In statistical modelling, the problem of parameter estimation is in many cases technically quite involved, because it amounts generally to solving a complicated set of equations. In this section, technicalities will be skipped almost entirely, because of space limits and, more important, because estimation procedures are usually made available in computer programs that do not require the user to understand all technical considerations. In Section 3, statistical tests are discussed and special attention is given to the problem of power. An IRT model, considered as a complex hypothesis, may be defective in many ways, and some tests are not sensitive to specific defects. It is argued that the most important aspect of testing is the creativity to find ways in which defects may be reflected in some aspects of the data. Careful statistical testing is the key procedure needed to make a considered decision of accepting or rejecting an IRT model and can also be found useful in choosing the most appropriate IRT model and generating relevant person estimates. Finally, in the last section of this chapter, we discuss the problem of how to use the results of an IRT analysis in estimating student achievement and searching for the impact of effectiveness factors operating at different levels.