M . Schrepp Mathematical Social Sciences 38 1999 361 –375
367
correct implications. The more simulated data patterns are available, the better is the ability of ITA to reconstruct the correct implications.
The simulation results are quite different for the four surmise relations. Therefore, the structure of the surmise relation also had an influence on the error rate.
As the results show, the error rate is low if the error probabilities a, b are low and the number of non-connected pairs in the surmise relation is low i.e. if the surmise relation
is more or less linear. The error rate increases significantly with the number of non-connected pairs in the surmise relation.
For example, for 8 and a 5 b 5 0.07 the error rate is around 8. So in average eight
4
of the 56 possible pairs i, j are misclassified. The error rate does not decrease with m, thus the error is systematical.
4. Improvement of ITA
As a result of the simulation study we modify ITA concerning two points. First, we construct only transitive relations. Second, we change the method of determining the
most adequate tolerance level.
4.1. Inductive construction of surmise relations The idea is to define the relations 8 we use a different symbol to distinguish these
L
relations from the relations constructed in ITA by an inductive construction
L
process. We start this inductive construction process with the transitive relation 8 [ .
Assume in the induction step that we have already constructed a transitive relation 8
. Define the set S by:
L L 11
S [hi, j ub L 1 1
∧ i W j
j.
L 11 ij
L
The set S consists of all item pairs i, j [
⁄ 8 which have at least L 1 1
L 11 L
counterexamples in the dataset. The elements of S are the candidates which may be
L 11
added to 8 in this step of the construction process.
L
To ensure that the relation 8 is transitive we must avoid adding pairs i, j from
L 11
S to 8
which cause an intransitivity to pairs contained in 8 or other pairs added
L 11 L 11
L
in this step. A pair i, j [ S
is contained in an intransitive triple concerning 8 if there exists
L 11 L
j, k with b L 1 1 ∧
b . L 1 1 or h, i with b L 1 1 ∧
b . L 1 1. Define
jk ik
hi hj
T8 as the set of all i, j [ S which are contained in at least one intransitive triple
L L 11
concerning 8 .
L
To ensure the transitivity of 8 we add only those pairs from S
to 8 which
L 11 L 11
L
are not contained in an intransitive triple concerning 8 . Therefore,
L
8 [8 hi, j [ S
ui, j [⁄ T8 j.
L 11 L
L 11 L
368 M
. Schrepp Mathematical Social Sciences 38 1999 361 –375
It is possible that the set hi, j [ S
ui, j [⁄ T8 j is empty. In this case 8
and
L 11 L
L 11
8 are identical.
L
The relation 8 is transitive, since 8
was assumed to be transitive and the
L 11 L
construction makes sure that no intransitive triples can be added in a step of the construction process. If
is transitive, then 8 is equal to
. The set of
L 11 L 11
L 11
surmise relations constructed by this inductive process contains therefore all transitive relations . Note that i8 j implies i j for all i, j [ I.
L L
L
If the item set I is big or if the error probabilities a and b are high, then many of the relations
will be intransitive. In this case the number of different relations 8 will
L L
be much higher than the number of transitive relations . Therefore, the inductive
L
construction process allows us to search for the best fitting relation in a bigger set of transitive relations. This point is discussed also in our practical application in the next
section.
4.2. A new method to determine the optimal tolerance level Assume that 8
is the ‘correct’ surmise relation. How many counterexamples for
L
j →
i should occur under this assumption? We have to distinguish two cases. First, assume i W j. Then b
should be the
L ij
expected value of the number of data patterns r with ri 5 0 and r j 5 1. This number is given by
b 5 1 2 p p m,
ij i
j
where m is the size uRu of the dataset.
Second, assume i8 j. Then only violations of j →
i through random errors should
L
occur. The expected number of violations b is in this case given by
ij
b 5 g p m,
ij L
j
where g is a constant which describes the probability of random errors. Therefore, b is
L ij
the number of data patterns in which item j is assigned a 1 since only in this case a violation can occur multiplied by a constant which reflects the influence of random
errors. The basic idea is to use the comparison of the observed values b and the expected
ij
values b under the assumption that 8
is correct to determine the most adequate
ij L
tolerance level. If we assume that 8 is correct, then we are able to estimate the error constant g by:
L L
O
hb p mui8 j ∧
i ± j j
ij j
L
]]]]]]] g [
,
L
u8 u 2 n
L
where n is the size of the item set I and m is the number of response patterns in the dataset R. Here b p m is the number of observed counterexamples to i8 j relative to
ij j
L
the number of cases in which such a counterexample is possible. The value u8 u 2 n is
L
the number of non-reflexive implications in 8 .
L
M . Schrepp Mathematical Social Sciences 38 1999 361 –375
369 Table 2
Mean values for the number of item pairs misclassified by ITA
a b 8
8 8
8
1 2
3 4
m 50
100 200
50 100
200 50
100 200
50 100
200 0.03 0.03
3.8 3
2.8 3.6
2.5 1.6
3.4 2.1
0.7 2.1
1.2 0.6
0.03 0.05 3.5
3.1 2.7
3.9 2.6
1.5 3.8
2.3 1.2
2.8 1.6
1.1 0.03 0.07
3.7 3.2
2.6 4.1
2.6 1.6
4.3 3.1
1.9 3
2.2 1.3
0.05 0.03 3.7
3.2 2.8
4 2.6
1.9 4
2.2 1.4
2.5 1.2
0.7 0.05 0.05
3.7 3.2
2.7 4.4
2.6 1.7
4.1 2.7
1.6 3
1.8 0.8
0.05 0.07 3.7
3.1 2.6
4.6 2.8
1.8 4.9
3.1 2.5
3.2 2.3
1.1 0.07 0.03
4.5 3.1
2.6 4.2
2.9 2
3.8 2.8
1.5 2.5
1.2 0.5
0.07 0.05 4.2
3 2.4
4.8 3.2
1.9 4.3
2.9 1.8
2.8 1.6
0.7 0.07 0.07
3.9 3
2.5 5.2
3 2.1
4.8 3.6
2.5 3.2
2.4 1.2
The fit between 8 and the dataset R can now be evaluated by:
L 2
2
diff8 , R[
O
b 2 b n 2 n.
L ij
ij i ±j
The most adequate surmise relation 8 concerning R is the one with the minimal
L
diff8 , R value.
L
We call in the following the analysis method based on the inductive construction of surmise relations 8 and on the minimal diff-value ITA .
L
4.3. Simulation study The same procedure as described in the simulation study concerning ITA is used. The
only difference is that we now use ITA to analyse the simulated dataset Ra, b, m. Let 8
be the best solution concerning ITA .
ITA
Table 2 shows the simulation results for the different value combinations of a, b and m. For each combination of values 50 simulated datasets are generated and the mean of
D8, 8 over these 50 simulated datasets is shown.
ITA
As in the first simulation the value of D8, 8 depends on the values for a, b
ITA
and m. The higher the error probabilities are, the higher is the error rate. The more simulated data patterns are available, the lower is the error rate for fixed values of a,
b . As the results show the structure of the underlying surmise relation also had an
influence on the performance of ITA . But in contrast to ITA the performance of ITA is better for non-linear surmise relations than for linear ones. The effect of the structure
is here not as dramatic as for ITA. Another remarkable point is that the influence of m on the error rate is smaller than for ITA. Even with a small number of simulated data
patterns most pairs i, j are correctly classified concerning their dependency.
370 M
. Schrepp Mathematical Social Sciences 38 1999 361 –375
5. An example for data analysis with ITA and ITA