The Rasch model

2.2 The Rasch model

In Figure 8.2 a graphical display of the Rasch model (Rasch 1960) is shown. The horizontal axis represents the ability dimension, named ␪, and exactly as in Guttman’s model each student is represented by a point on this dimension. For each point on the line, the model expresses the probability of a correct response to each of the items in the test; so for each item there is curve associating the ability point to the probability of a correct answer. This curve is known as the item characteristic curve, or item response curve. (In the older literature, the term ‘trace line’ is used as well.) Each curve is the graphical display of a function, known as item characteristic function or item response function (IRF). The point on the ␪-scale that yields a probability of 0.5 for item i is labeled by a special symbol, ␤ i . In a general description of the model, the exact position of this point or, equivalently, its value is not known, but has to be estimated from the data. The quantities ␤ i for each item are referred to as item parameters.

There are a number of features in Figure 8.2 that also appear in other IRT models, or are distinctive for other models. They are discussed in turn:

• The IRFs are monotonically increasing, or in behavioural terms, the higher the ability the higher the probability of a correct response. This seems quite natural in achievement or ability testing, and all IRT models used in this area of research have this feature. In the area of attitude and preference

θ Figure 8.2 Item response functions in the Rasch model

160 Different methodological orientations research, single peaked IRFs are used sometimes, that is, functions that are

increasing up to some point on the scale and then decrease. Single-peaked functions also appear in the area of achievement testing, and some of these will be discussed in the next section; these functions, however, are not IRFs.

• In the model definition, the scale values are unbounded: they range from minus infinity to plus infinity. This feature is common to all IRT models. • The probability of a correct answer on any item is different from zero and one, but as ␪ increases without bound, the probability of a correct answer approaches one, and if it decreases without bound, the probability approaches zero. One says that zero and one are the lower and upper asymptotes respectively. This feature, especially about the lower asymptote, has been a source of a lot of criticism of the Rasch model. It will be discussed in more detail later.

• Item response curves in the Rasch model do not intersect: for all points on the ability scale it holds that item i has the highest probability of a correct response and item m the lowest one. Moreover, all curves in Figure 8.2 have exactly the same form; they differ only in location. Any curve in the figure can be shifted horizontally until it coincides completely with either of the other two.

These features jointly, however, do not define the Rasch model, as there can be many different mathematical functions that have the same features. In the Rasch model, the IRFs are defined as:

exp( θβ − )

f () θ =

1 + exp( θβ − i ) which is an increasing function of ␪ that depends on a single parameter ␤ i . (The

ii

expression exp(x) is just a convenient way to write down the exponential function

e x , where e is the base of the natural logarithms and equals approximately 2.718. Notice that e 0 = 1.) The function f i (␪) is a conditional probability. If we denote the outcome of an answer to item i as X i , then the meaning of the IRF becomes clear:

f i (␪) = P(X i = 1|␪). (2) Since the outcome of an item answer is binary, yielding the value one if the

answer is correct and zero otherwise, we immediately deduce from (1) and (2) that

PX ( i = 0 |) θ = 1 −

f i () θ =

1 + exp( θβ − i )

Using Item Response Theory 161 There are some interesting ways to look at the IRF defined by (1). • The curves in Figure 8.2 are very similar to the curves of the cumulative normal

distribution, but they are not representing the normal distribution. They represent the cumulative logistic distribution, which is similar to the normal distribution: it is symmetric, but has thicker tails than the normal distribution. The standard logistic distribution has a mean of zero and a variance equal to

␲ 2 /3. A normal distribution with a mean of zero and a standard deviation of 1.7 is very similar to the standard logistic distribution. The function in equation (1) is known by the name ‘logistic function’. Its argument is the difference ␪ – ␤ i .

• Since the outcome variables X i are binary, the expected value equals the probability that the outcome equals one:

E(X i |␪) = 0 × P(X i = 0|␪) + 1 × P(X i = 1|␪) = f i (␪). This means that the IRFs are regression functions of the outcome variables

X i on the latent variable ␪. The regression is not linear, as can be seen clearly from Figure 8.2. • An interesting function is the logit function or the log-odds function. It is given by

which is linear in ␪. Generalizations of the logit function are useful to understand the structure of other, more complicated models.

• The right-hand sides of equations (1) and (3) are fractions with the same denominator, meaning that this denominator does not depend on the specific value of the outcome variable X i . The denominator is the sum of both num - erators; clearly its function it to make sure that the sum of the probabilities of all possible outcomes equals one. This denominator is also called the normalizing constant. Therefore we can write (1) and (3) equivalently as

P(X i = 1|␪) ⬀ exp(␪ – ␤ i ) (1a) and P(X i = 0|␪) ⬀ 1,

(3a) where the symbol ⬀ means ‘is proportional to’. The normalizing constant

is one divided by the sum of the right-hand sides of (1a) and (3a).

162 Different methodological orientations Conditional independence

The Rasch model is not completely defined by its IRFs. These functions describe the marginal distribution of the outcome variables X i , conditional on ␪, but from these marginal distributions the joint distributions cannot be derived uniquely. To put it more simply: from (1) one cannot specify the probability P(X i = 1 and X j = 1|␪). Therefore, something more has to be added to the model to make it fully defined. This addition has the form of an assumption that is ubiquitous in statistical modelling: the assumption of conditional independence or local stochastic independence, the term ‘local’ pointing to the fact that the latent variable is fixed. Let a test consist of k items, and let X =

(X 1 , . . ., X k ) be the vector of outcome variables, also called the response pattern. Let x = (x 1 , . . ., x k ) be a realization of X, that is, x is some observable response pattern. The assumption of conditional independence states that

PX ( = x |) θ = ∏ PX ( i = x i | ), θ

for all possible response patterns x. This assumption is analogous to the axiom of independent measurement errors in Classical Test Theory. Notice that this assumption does not say that item answers are independent and hence correlate zero; it says that item answers are independent in all populations where the latent variable ␪ is constant, and hence that correlations between item responses are zero in such populations. However, this means also that if, in some popula - tion, item responses do correlate, that this correlation is explained (completely) by the variation in the latent variable ␪. In this sense, the Rasch model is very similar to the one-factor model. The relation between factor analysis and IRT models will be discussed further in Section 6.

Exponential families Apart from the conditional independence, there is another principle of indepen-

dence that applies, namely, experimental independence. This principle says that test performances, given the latent abilities of a group of students, are inde - pendent of each other. As an example consider a sample of n students, and denote the latent ability of a single student v by ␪ v . The item responses of the n students to a test of k items are collected in an n × k matrix X with the rows representing students and the columns corresponding to the items. X is a multi - variate random variable, which on administration of the test will take particular values or realizations. These realizations are indicated by x. The v-th row of X and x will be denoted as X v and x v , respectively, and individual elements as X vi and x vi . The principle of experimental independence states that

P ( X = x | θ 1 ,, … θ n ) = ∏ PX ( v = x v | θ ),

Using Item Response Theory 163 Substituting the right-hand side of equation (4) into equation (5) gives as a

result:

P ( X = x | θ 1 ,, … θ n ) = ∏ PX ( vi = x vi | θ v ),

and using (1) and (3) one finds that

exp

v θ v i x vi −

∑ ∑ v x vi ⎤⎦

P ( X = x | θ ,, … θ ) =

∏ ∏ ⎡⎣ 1 + exp( θ − β v ) = v i 1 = i ⎤⎦ 1

The probability of the observed data, considered as a function of the unknown quantities ␪ v and ␤ i , is called the likelihood function. Defining

s v = ∑ x vi and t i = ∑ x vi ,

kn

and taking the logarithm of (7) gives

ln P ( X = x | θ 1 ,, … θ n ) = ∑ s vv θ + ∑ t i ( − β i ) − ∑ ∑ ln ⎡⎣ 1 1 + exp( θ v − β i ).

(8) The right-hand side of equation (8) consists of two important parts: the two

v = 1 i = 1 v = 1 i = 1 ⎤⎦

sums that contain functions of the data (s v and t i ), and the double sum, which is independent of the data. Each term in the first two sums consists of a product, one factor being a function of the unknown quantities in the model (␪ v and –␤ i ) and the other factor a function of the data (s v and t i ). Models for which the log-likelihood function can be written in this form are referred to as ‘exponential family models’. Such models have attractive features that are used in the parameter estimation procedures to be discussed in the next chapter.

The quantity s v is the row total of row v of the observed data matrix x, and t i is the column total of the i-th column. From (8) we see that the likelihood of the observed data depends on the data only through its marginal sums, or, equivalently, that under the Rasch model, all observed matrices with the same marginal sums are equiprobable. This means also that anything we can learn about the latent ability of student v is contained in this row sum s v , which is called the ‘sufficient statistic’ for the unknown quantity ␪ v . Similarly, the column sums t i are the sufficient statistics for the item parameters.

This section is concluded with a general consideration of the notions of independence that have been discussed so far. It may seem that the Rasch model (and, in fact, any of the other models that will be discussed subsequently) is unrealistic, as any researcher in the area of EER knows that test data may show substantial dependence, due, for example, to school or classroom effects. There

164 Different methodological orientations is, however, no contradiction in this because equation (7) describes a conditional

probability, where the condition is the collection of latent values represented in the sample. Roughly formulated, the axiom of conditional independence means that each new item is a new opportunity to show one’s ability, and that the probability of success does not depend on failures or successes on other items. The principle of experimental independence means simply that students have to work alone and independently of their classroom peers. The lack of independence often encountered in EER is due to the effect of using a sampling scheme different from simple random sampling, such as cluster sampling. In terms of equation (7), this means that the students and, consequently, their latent abilities are not independent of each other, but this dependence is a dependence in the condition of the conditional probability, not in the outcome variables. Or to put it slightly differently, equation (7) is assumed to hold, no matter how the sample of the n students has been drawn.