Non-parametric tests of the Rasch model

3.1 Non-parametric tests of the Rasch model

In exponential family models, the likelihood of the observed data depends on the observed data only through the sufficient statistics (see section 2.1). This has an important implication: if the model is valid, all data sets with the same sufficient statistics are equiprobable. Take a coin-tossing experiment as an example. Suppose a coin is tossed n times and lands heads (= success) m times. The model for the outcomes is relatively simple: it states that the probability of landing heads is ␲ for all trials and that all outcomes are mutually independent. The likelihood of the outcomes under this model is ␲ m (1 – ␲) n–m , that is, the

198 Different methodological orientations number of successes is the sufficient statistic for the parameter ␲. To estimate

the parameter ␲, only the proportion of successes is used, but one can look at the internal structure of the data to judge the trustworthiness of the model. Suppose, for example, that n = 500 and m = 250. The ML estimate of ␲ is 250/500 = 0.5, but on closer inspection of the outcome sequences, it appears that the first 250 trials were a success and the last 250 a failure. Although such

a sequence is as probable as any other sequence with 250 successes, it is very likely that one will not accept the model because it has too few runs. (A run is

a sequence of equal outcomes. In the example there are two runs). One might question, therefore, the assumption of independence of the trial outcomes. To have a rational judgement on the number of runs, one needs to know the distribution of the number of runs under the null hypothesis and conditional on the value of the sufficient statistic (that is, 250 of the 500 trials were a success). For this example, this distribution can be derived mathematically – see the discussion of the runs test in Siegel and Castellan (1988) – but the distribution can also be approximated to an arbitrary degree of accuracy by sampling a large number of sequences of 500 trials with exactly 250 successes and the number of runs determined for each sequence; the percentile rank of the empirical outcome can be determined in this distribution. If it is smaller than 2.5 or larger than 97.5, the null hypothesis (the model) is rejected – that is, the test rejects at a significance level of 5 per cent.

The versatility of this approach is clear from the fact that we may apply it to other statistics than the number of runs. In fact, it can be applied to any statistic, and it depends on the imagination of the researcher to find a statistic that may

be indicative for some special defect in the hypothesis. Suppose, for example, that one has a suspicion that the value of ␲ has decreased systematically during the experiment. If this were true one would expect fewer successes in the second half of the experiment than in the first half, and so a suitable statistic to test this hypothesis would be the difference in number of successes between the first and second half of the experiment.

Exactly the same reasoning as in the coin tossing example may be applied to the Rasch model: the sufficient statistics for the item parameters and the latent values of the tested students are the marginal totals of the data matrix. This means that, if the Rasch model is valid, all n × k binary tables with the same marginal totals as the observed one are equiprobable, and for any statistic one can approximate the sampling distribution by drawing at random a large number of these tables and by computing the statistic on each of these. The value of the statistic in the empirical table can then be compared to the simulated distribution, that is, its p-value can be computed.

The important difference between an application with the coin tossing example and the Rasch model is that in the former it is easy to draw a random sequence of 500 outcomes with 250 successes, while drawing at random a binary table with given marginal totals is extremely difficult; in fact, no procedure for how to accomplish this has thus far been found. Methods exist, however, for sampling

IRT models 199 in a way that gives a simulated sampling distribution that approximates the true

distribution. Two classes are studied in the literature, one based on importance sampling and one based on MCMC techniques. A detailed account with references to earlier work can be found in Verhelst (2008). Applications for any statistic can be run in R (Verhelst, Hatzinger and Mair 2007). The user has to program a function in R where the statistic(s) of interest is computed. Unfortunately, the sampling procedure only applies to the Rasch model in a complete design. Generalizations to incomplete designs and to exponential family models for polytomous data, such as the PCM, are still needed.