Informal procedures

3.4 Informal procedures

Every statistical model, no matter how complicated it may be, is a simplification of reality, and therefore it cannot be the ‘true’ model. This implies that if one uses tests with enough power, for example by using huge samples, these tests will eventually all lead to significant results, and reasoning in a pure formal way, one cannot but reject the model. The search for the true model is vain, and a much more comfortable approach is to search for a useful model – that is, a model that represents (and can reproduce) important characteristics of the real world, where ‘important’ is always to be understood as important to one’s purposes. The model of the sea level as a flat surface is useful for geography, but it will not be of any use for a shipwrecked person fighting to survive in a storm.

Therefore, a far more constructive attitude towards statistical models than the pure formal binary-decision directed attitude of statistical testing (accept or reject) is to try to come to a judgment if the model is a reasonably good approx - imation to reality or not. There is a lot one can do to judge on this reasonableness, and the actions one can take could be summarized under the name ‘give your model a chance’. We discuss some examples below.

Suppose that for some research purposes one has administered a test of 40 arithmetic items to a sample of young students. The general assumption is that

206 Different methodological orientations the scores obtained on this test will reflect the mathematics ability of the students,

and a more fine-grained assumption is, for example, that the Rasch model (or any other model, for that sake) might be well suited to describe the empirical data. There are a number of things that one could (and should) do before starting the IRT analysis. Three suggestions follow.

Inspect the histogram of the score distribution. An unexpectedly high frequency of zero (or very low scores) may point to students who were not really taking the test.

Be sure to have a reasonable prior estimate of the difficulty of the items. If an item judged to be relatively easy by the test constructor turns out to be very difficult empirically, this may point to very practical problems such as an error in the key for multiple choice items or the effect of time pressure.

If data are collected through a two-stage sampling (first school, then students within schools), something might have gone wrong in a particular school (testing time too short, misunderstanding of the instructions, and so on). An efficient way to find out if such systematic errors have occurred is to run an analysis with an overparameterized model that makes very weak assumptions about the data. A good candidate is homogeneity analysis (Gifi 1990; Michailidis and De Leeuw 1998). In this analysis, the data are considered as nominal variables. The outcome of the analysis represents students as well as item categories as points in a Euclidean space of low dimensionality. The point representing the student is the midpoint (centre of gravity) of all the category points that represent his response pattern. Schools can

be represented as the midpoint of all the student points of the students belonging to the same school. If the analysis is done in two or three dimensions, a graphical representation can be constructed where all schools are represented as a single point, and outlying schools are easily detected.

An important assumption of the IRT models discussed in this chapter is unidimensionality. One can apply a formal test of this assumption, as was mentioned in Section 3.3, but a simple Exploratory Factor Analysis (EFA) may

be of equal use. In Figure 9.7, the factor pattern resulting from a factor analysis with two factors is displayed graphically. The data are the responses of 1,332 Hungarian students on a reading and listening test for English (the author is indebted to Euro Examinations in Budapest and, especially, to Zoltán Lukacsi for the permission to use the language test data for illustrative purposes). The reading part and the listening part both consist of 25 binary items. In the graph, the items are not identified, only the skill they belong to is indicated: R for reading and L for listening. It can be clearly seen that the vertical axis (the second factor) distinguishes between these two skills, and therefore it might be wiser to consider the two skills as representing two different abilities, rather than to treat them as representing the same ability.

Some comments are in order when using factor analysis on binary data: • It is highly advisable to use tetrachoric correlations instead of Pearson

product-moment correlations, as the latter tend to produce more factors

IRT models 207

Figure 9.7 Factor pattern of reading and listening items

that are barely interpretable. For partial credit items, polychoric correlations are the preferred ones.

• The matrix of tetrachoric (or polychoric) correlations computed from finite data sets is often not positive semi definite (psd) and therefore cannot be used as input for factor analytic procedures that presuppose a psd correlation matrix, such as maximum likelihood factor analysis. To cope with such a situation one can follow two different strategies; either the computed correlation matrix is replaced by a similar one that is psd (Knol and Ten Berge 1989), or a factor analytic procedure is chosen that does not require

a psd matrix as input. Good and easily available techniques include principal factor analysis (Harman 1960) or the minimizing residuals (MINRES) method (Harman and Jones 1966). If only exploratory analyses are done, using techniques that do not require a psd input matrix is the easiest way. For confirmatory analyses, the requirement of a psd matrix is unavoidable.