Incomplete designs

1 Incomplete designs

The basic idea of using IRT models where not all students take the same test is that two students will only be comparable if the tests they took have something in common or, exchanging the roles of students and items, that two items are comparable if there are at least some students having taken both items. Graphically, this amounts to a very simple design requirement, which is exemplified in Figure 9.1, for two groups of students. The shaded cells represent the sets of items administered to each group. In the left-hand panel, there are

184 Different methodological orientations

Set 1 Set 2 Set 1 Set 2 Set 3

Group 1

Group 2 Non-linked design

Linked design

Figure 9.1 Simple two group designs

two tests, coinciding with two disjointed sets of items. In this design it seems to be impossible to compare the two groups of students, since they have no items in common, nor to compare items from set 1 and set 2, because they have no students in common. It is said that this design is not linked. In the right- hand panel, the items are partitioned into three sets, and the first test contains sets 1 and 2 and the second sets 2 and 3, so that set 2 is common to both tests and comparisons are possible. Notice that in this design, items of set 1 can be compared to items of set 3, although indirectly, because items in these two set are comparable to the items of the common set 2.

This kind of indirect comparability is used to define linked designs for an arbitrary number of sets of items and an arbitrary number of groups of students. Formally, any two items, i and m, say, are linked if there exists a chain of items (i, j, g, . . ., h, m) such that each adjacent pair (i, j), (j, g), . . ., (h,m) of items in the chain has been administered to at least one student. The design is said to be linked if all pairs of items are linked in this way.

There exist a number of frequently used designs, which will be discussed briefly. Figure 9.2 contains an anchor test design and Figure 9.3 displays two block-interlaced anchoring designs. The sets of items referred to in Figure 9.1 are usually called ‘blocks’. In the anchor test design one block of items is common for all groups of students. This block is the anchor test. In the interlaced design, a kind of chain with overlapping blocks is constructed. Notice that the resulting figure looks a bit similar to a staircase, but an important feature is given by the shaded blocks in the bottom left of the figure.

In the anchor test design with m groups of students, there are m + 1 blocks; one block is common to all groups and all other blocks are unique. So there are m mutually disjointed groups of students. In the interlaced case, all blocks are administered to an equal number of groups, and, if there are m blocks, the total number of groups required is m. In these designs, it is not required that blocks contain the same number of items.

Using linked designs is by far the most important feature in choosing test designs, as parameter estimation is hard or even impossible in non-linked designs. However, there are more things worth considering about test designs:

IRT models 185

Block 1 Block 2

Block 3 Block 4

Figure 9.2 Anchor test design B1 B2 B3 B4 B5 B1 B2 B3 B4 B5

Figure 9.3 Two block interlaced anchoring designs • How are individual students assigned to one of the five groups (see Figure

9.3)? By far the safest procedure is to use random assignment, although this is not always possible because of practical constraints – for example, if

a whole class has to take the same test because the items are read aloud by the teacher. • An implicit assumption in the application of IRT models is that the latent ability of individual students remains constant during the test administration, but in practice, effects such as fatigue and boredom may lead to violations of this assumption. Therefore, it is recommended to control for sequential effects by making sure that the same block of items does not always appear at the beginning or at the end of the tests in which it appears.

• Linking is not merely an all-or-none feature. Links can be strong or weak. In the left-hand panel of Figure 9.3, each of the five blocks is linked to two other blocks and not linked (directly) to two other blocks, while in the

186 Different methodological orientations right-hand panel all pairs of blocks are linked directly, although not an equal

number of times. For example, the pair (B1, B2) occurs twice (in Groups

1 and 5), while the pair (B1, B3) occurs only once (in Group 1). • Incomplete designs where all blocks appear an equal number of times in each sequential position and where all pairs of blocks appear an equal number of times are called balanced incomplete block (BIB) designs. For details on such designs, see Cochran and Cox (1957).

To have the full advantage of the features of BIB designs, it is assumed that students are allocated randomly to the test forms, and such a random allocation is usually assumed in the other designs as well. In experimental studies where such a random allocation is feasible, BIB designs are optimal. However, in developmental studies they are typically not suited, as are neither of the other designs discussed so far.

In the student monitoring system developed by the National Institute for Educational Measurement (CITO) in the Netherlands, performances in a certain domain (such as reading comprehension or mathematics) are scaled so as to be comparable for the whole period of basic education (running from six to twelve years of age). In a calibration study using item material that encompasses the curriculum of six grades of formal instruction, none of the designs discussed so far would be realistic, because the material developed for the higher grades is inaccessible for the lower grades, and the material typically developed for the lower grades will in many cases be trivial for the higher grades. Or to put it a bit differently, in developmental studies the content of the test forms has to correspond quite accurately to the implemented curriculum. So the general form of a design that can be applied in such cases is something similar to the design presented in Figure 9.4. This approach was also used by studies measuring the effect of schooling by collecting data from different age groups of students (Kyriakides and Luyten 2009).

Figure 9.4 Test design for developmental studies

IRT models 187 Although the design displayed in Figure 9.4 looks similar to the block-

interlaced anchoring design of Figure 9.3 (left-hand panel), there are three important differences:

• The groups of students in the present design are statistically not equivalent; to the contrary they are selected to be homogeneous with respect to the concept to be measured. The label ‘grade’ is just a reminder that such a selection is in operation.

• The construction of the blocks of items is restricted: for example, it is assumed that blocks 1 and 2 are suitable for grade 1 students. In general, the blocks will be ordered roughly in terms of the difficulty of the items they contain.

• As a natural consequence of the two preceding features, there will be no shaded cell in the bottom left corner of the design table. In a statistical sense this makes the design less stable than the interlaced design, but this is a restriction one cannot escape when applying IRT in developmental studies.

A further complication arises when the study is a longitudinal study, where the same cohort of students is followed for a number of years. Applying the design of Figure 9.4 as it is presented here may cause unwanted effects. If the sample of students consists of the same people in all four grades, one block of items will be administered twice to the same students in two consecutive years, and the difference between performances may be attributed to growth in the ability or to memory effects, and these two causes are confounded. To avoid such a situation, a more complicated design is needed. The blocks B1 to B5 referred to in Figure 9.4 are to be conceived of as composed of different blocks, and the design has to take care that no student will get the same block in two consecutive years. A small example is given in Figure 9.5. Here it is assumed that students belonging to group ‘grade 1(a)’ belong to the group ‘grade 2(a)’ the next year, and similarly for the (b) students. The design in Figure 9.5 is linked, and no student sees the same items twice.

B3

B1 B2(1) B2(2)

Grade 1 (a)

Grade 1 (b)

Grade 2 (a)

Grade 2 (b)

Figure 9.5 A design suited to longitudinal studies

188 Different methodological orientations This design is used in longitudinal studies measuring the short- and long-

term effect of schools (Kyriakides and Creemers 2008). Incomplete designs and missing observations

The theory of parameter estimation, to be discussed in the next section, is in general easily adapted to incomplete designs, making the comparison of test performances possible, even in longitudinal studies, where at each measurement occasion the test form administered to any student does not contain any items answered before. This high degree of flexibility might suggest the idea that the use of incomplete designs is also the ultimate elegant solution to treat missing observations in a data matrix: one just treats an incomplete data matrix as the realization of an incomplete design. The implication of such an approach – which is flawed in general – is that every skipped item (by the student) is treated as if this item has not been administered, but a clever student, being aware of this approach, can develop a strategy of skipping all items where he is not very sure about the correct answer. This will increase his test score on the answered items and in general will lead to a biased estimate of his ability.

In general, there is no unique methodology on how to treat missing observa - tions, and all approaches usually rest on assumptions that should be carefully checked. More information on treating missing observations can be found in the seminal paper by Rubin (1976) and in Little and Rubin (1987).