EMPIRICAL ADEQUACY ASSESSMENT
8.5 EMPIRICAL ADEQUACY ASSESSMENT
Whereas in the foregoing discussions, we have attempted to characterize the adequacy of a test data T with respect to test selection requirements by means of analytical argu- ments, in this section we consider empirical arguments. Specifically, we ponder the question: How can we assess the ability of a test set T to expose faults in candidate programs? A simple-minded way to do this is to run candidate programs on a test set T and see what proportion of faults we are able to expose; the trouble with this approach is that we do not usually know what faults a program has. Hence, if execu- tion of program p on test set T yields no failures, or few failures, we have no way to tell whether this is because the program has no (or few) faults or because test set T is
156 TEST GENERATION CONCEPTS
inadequate. To obviate this difficulty, we generate mutants of program p, which are programs obtained by making small changes to p, and we run all these mutants on test set T; we can then assess the adequacy of test set T by its ability to distinguish all the mutants from the original p, and to distinguish them from each other. A note of caution is in order, though: it is quite possible for mutants to be indistinguishable, in the sense that the original program p and its mutant compute the same function; in such cases, the inability of set T to distinguish the two programs does not reflect negatively on T. This means that in theory, we should run this experiment only on mutants which we know to be distinct (i.e., to compute a different function) from the original; but because it is very difficult in practice to tell whether a mutant does or does not com- pute the same function as the original, we may sometimes (for complex programs) run the experiment on the assumption that all mutants are distinct from the original, and from each other.
As an illustrative example, we consider the following sorting program, which we had studied in Chapter 6; we call it p.
void somesort (itemtype a[MaxSize], indextype N) // line 1 {
// 2 indextype i; i=0;
// 3 while (i<=N-2)
// 4 {indextype j; indextype mindx; itemtype minval;
// 5 j=i; mindx=j; minval=a[j];
// 6 while (j<=N-1)
// 7 {if (a[j]<minval) {mindx=j; minval=a[j];}
// 8 j++;}
// 9 itemtype temp;
// 10 temp=a[i]; a[i]=a[mindx]; a[mindx]=temp;
Imagine that we have derived the following test data to test this program:
T Index N
Array a[..]
Comment/rationale
t 1 1 [5]
Trivial size
t 2 2 [5,5]
Borderline size, identical elements
t 3 2 [5,9]
Borderline size, sorted
t 4 2 [9,5]
Borderline size, inverted
t 5 6 [5,5,5,5,5,5]
Random size, identical elements
t 6 6 [5,7,9,11,13,15]
Random size, sorted
t 7 6 [15,13,11,9,7,5]
Random size, inverted
t 8 6 [9,11,5,15,13,7]
Random size, random order
8.5 EMPIRICAL ADEQUACY ASSESSMENT 157
The question we ask is: How adequate is this test data? If we run our sorting routine on this data and all executions are successful, how confident can we be that our pro- gram is correct? The approach advocated by mutation testing is to generate mutants of program p by making small alterations to its source code and checking to what extent the test data is sensitive to these alterations. Let us, for the sake of argument, consider the following mutants of program p:
void m1 (itemtype a[MaxSize], indextype N) // line 1 {
// 2 indextype i; i=0;
// 3 while (i<=N-1)
// 4 {indextype j; indextype mindx; itemtype minval;
// changed N-2 into N-1
// 5 j=i; mindx=j; minval=a[j];
// 6 while (j<=N-1)
// 7 {if (a[j]<minval) {mindx=j; minval=a[j];}
// 8 j++;}
// 9 itemtype temp;
// 10 temp=a[i]; a[i]=a[mindx]; a[mindx]=temp;
void m2 (itemtype a[MaxSize], indextype N) // line 1 {
// 2 indextype i; i=0;
// 3 while (i<=N-2)
// 4 {indextype j; indextype mindx; itemtype minval;
// 5 j=i; mindx=j; minval=a[j];
// 6 while (j<N-1)
// 7 {if (a[j]<minval) {mindx=j; minval=a[j];}
// changed <= into <
// 8 j++;}
// 9 itemtype temp;
// 10 temp=a[i]; a[i]=a[mindx]; a[mindx]=temp;
void m3 (itemtype a[MaxSize], indextype N) // line 1 {
// 2 indextype i; i=0;
// 3 while (i<=N-2)
// 4 {indextype j; indextype mindx; itemtype minval;
// 5 j=i; mindx=j; minval=a[j];
// 6 while (j<=N-1)
// 7 {if (a[j]<=minval) {mindx=j; minval=a[j];} // changed < into <=
// 8 j++;}
158 TEST GENERATION CONCEPTS
itemtype temp; // 10 temp=a[i]; a[i]=a[mindx]; a[mindx]=temp;
void m4 (itemtype a[MaxSize], indextype N) // line 1 {
// 2 indextype i; i=1;
// 3 while (i<=N-2)
// changed 0 into 1
// 4 {indextype j; indextype mindx; itemtype minval;
// 5 j=i; mindx=j; minval=a[j];
// 6 while (j<=N-1)
// 7 {if (a[j]<minval) {mindx=j; minval=a[j];}
// 8 j++;}
// 9 itemtype temp;
// 10 temp=a[i]; a[i]=a[mindx]; a[mindx]=temp;
void m5 (itemtype a[MaxSize], indextype N) // line 1 {
// 2 indextype i; i=0;
// 3 while (i<=N-2)
// 4 {indextype j; indextype mindx; itemtype minval;
// 5 j=i; mindx=j; minval=a[j];
// 6 while (j<=N-1)
// 7 {if (a[j]<minval) {mindx=j; minval=a[j];}
// 8 j++;}
// 9 itemtype temp;
// 10 a[i]=a[mindx]; temp=a[i]; a[mindx]=temp; // inverted the first two statements
Given these mutants, we now run the following test driver, which considers the mutants in turn and checks whether test set T distinguishes them from the original program p.
void main () {for (int i=0; i<=5; i++) // does T distinguish
// mutant (i) from p {for (int j=1; j<=8; j++) // is p(tj) different from mi(tj)? {load tj onto N, a;
8.5 EMPIRICAL ADEQUACY ASSESSMENT 159 run p, store result in a’;
load tj onto N, a; run mutant i, compare outcome to a’;}
if one of the tj returned a different outcome from p, announce:
“mutant i distinguished”
else announce: “mutant i not distinguished”;} }; // assess T according to how many mutants were distinguished
The actual source code for this is shown in the appendix. Execution of this program yields the following output, in which we show for each test datum t j and for each
mutant m i whether execution of the mutant on the datum yields a different result from execution of the original program p on the same datum.
Mutants T
True True t 2 True
t 1 True
True
True
True True t 3 True
True
True
True True t 4 True
True
True
False False t 5 True
True
True
True True t 6 True
True
True
True True t 7 True
True
True
False False t 8 True
False False
Before we make a judgment on the adequacy of our test data set, we must first check whether the mutants that have not been distinguished from the original pro- gram are identical to it or not (i.e., compute the same function). For example, it is
clear from inspection of the source code that mutant m 1 is identical to program p: indeed, since program p sorts the array by selection sort, then once it has selected the smallest N −1 elements of the array, the remaining element is necessarily the
largest; hence, the array is already sorted. What mutant m 1 does is to select the Nth element of the array and permute it with itself—a futile operation, which program p skips. Mutant m 3 also appears to compute the same function as the original program p, though it selects a different value for variable mindx when the array contains duplicates; this difference has no impact on the overall function of the program.
The question of whether mutant m 2 computes the same function as the original program is left as an exercise.
160 TEST GENERATION CONCEPTS
In general, once we have ruled out mutants that are deemed to be equivalent to the original program, we must consider the mutants that the test data did not distinguish from the original program even though they are distinct and raise the question: What additional test data should we generate to distinguish all these mutants? Conversely, we can view the proportion of distinct mutants that the test data has not distinguished as a measure of inadequacy of the test data, a measure that we should minimize by adding extra test data or refining existing data.
Note that the test data t 1 ,t 2 ,t 3 ,t 5 , and t 6 does not appear to help much in testing the sorting program, as they are unable to distinguish any mutant from the original program. In addition to its use to assess test sets, mutation is also used to automatically correct minor faults in programs, when their specification is available and readily testable: one can generate many mutants and test them against the specification using an adequate test set, until it encounters a mutant that satisfies the specification. There is no assurance that such a mutant can be found, nor that only one mutant can be found to satisfy the specification, nor that a mutant that satisfies the specification is more correct than the original program; nevertheless, this technique may find some uses in practice.