Criteria for Completion of Testing

18.1.4 Criteria for Completion of Testing

A classic question arises every time software testing is discussed: "When are we done testing—how do we know that we've tested enough?" Sadly, there is no definitive answer to this question, but there are a few pragmatic responses and early attempts at empirical guidance.

? One response to the question is: "You're never done testing, the burden simply

When are

we done

shifts from you (the software engineer) to your customer." Every time the cus-

testing?

tomer/user executes a computer program, the program is being tested. This sober- ing fact underlines the importance of other software quality assurance activities. Another response (somewhat cynical but nonetheless accurate) is: "You're done test- ing when you run out of time or you run out of money."

Although few practitioners would argue with these responses, a software engi- neer needs more rigorous criteria for determining when sufficient testing has been conducted. Musa and Ackerman [MUS89] suggest a response that is based on sta- tistical criteria: "No, we cannot be absolutely certain that the software will never fail, but relative to a theoretically sound and experimentally validated statistical model, we have done sufficient testing to say with 95 percent confidence that the probabil- ity of 1000 CPU hours of failure free operation in a probabilistically defined environ- ment is at least 0.995."

F I G U R E 18.3

Failure

Data collected during testing

intensity as a function of execution time

Predicted failure intensity, l(t)

Failures per test hour

Execution time, t

Using statistical modeling and software reliability theory, models of software fail- ures (uncovered during testing) as a function of execution time can be developed [MUS89]. A version of the failure model, called a logarithmic Poisson execution-time model, takes the form

(18-1) where

f(t) = (1/p) ln [l 0 pt + 1]

f(t) = cumulative number of failures that are expected to occur once the software has been tested for a certain amount of execution time, t, l 0 = the initial software failure intensity (failures per time unit) at the begin-

ning of testing, p = the exponential reduction in failure intensity as errors are uncovered

and repairs are made.

The instantaneous failure intensity, l(t) can be derived by taking the derivative of f(t)

(18-2) Using the relationship noted in Equation (18-2), testers can predict the drop-off of

l(t) = l 0 / (l 0 pt + 1)

errors as testing progresses. The actual error intensity can be plotted against the pre- dicted curve (Figure 18.3). If the actual data gathered during testing and the loga- rithmic Poisson execution time model are reasonably close to one another over a number of data points, the model can be used to predict total testing time required to achieve an acceptably low failure intensity.

By collecting metrics during software testing and making use of existing software reliability models, it is possible to develop meaningful guidelines for answering the question: "When are we done testing?" There is little debate that further work remains to be done before quantitative rules for testing can be established, but the empirical approaches that currently exist are considerably better than raw intuition.

PA R T T H R E E C O N V E N T I O N A L M E T H O D S F O R S O F T WA R E E N G I N E E R I N G