Lines of Defense
6.4.1 Lines of Defense
Programs fail to be correct or reliable because they have faults. Hence any effort to improve reliability and/or to enhance the probability of correctness ought to focus on faults. Ways to deal with faults are traditionally divided into three broad categories:
1. Fault Avoidance. Methods that fall under the header of fault avoidance focus on developing software products that are free of faults by construction. These methods use the type of techniques that we discuss in Chapter 5 to verify that
6.4 FAULT MANAGEMENT 117
programs are correct as they are constructed. More sophisticated methods turn the verification techniques around to generate methods for developing pro- grams from specifications, by stepwise manipulation of the specification, in
a way that ensures the correctness of the final program by construction, rather than by inspection; some models of program construction cast the task of pro- gram derivation as a calculation involving the target specification and a design that is taking shape as design decisions are taken. The main difficulty with fault avoidance methods is that they do not scale up easily or reliably to large scale development.
2. Fault Removal. If we cannot avoid faults at program construction, perhaps we can try to remove them once the program is developed; this is the philosophy of fault removal methods and the focus of software testing. Fault removal methods face two obstacles in practice: • First, we can never be sure that we have removed all the faults in a program;
the methods discussed in Chapter 5 are intended to ensure the absence of faults, to the extent that they scale up to programs of realistic size.
• Second, we can never be sure that while removing one fault we are not inad- vertently introducing others. The framework of monotonic fault removal introduced in this chapter is intended to ensure that the programs become increasingly more-correct with each fault removal, to the extent that it can scale up to programs of significant size and complexity. One way to increase the effectiveness of fault removal is to ensure that we target the most egregious faults first, that is, those that have the greatest (neg- ative) effect on reliability, to maximize the return on investment on the fault removal effort; also it is generally agreed that a software may be reliable despite having faults, provided the residual faults have a low impact on reliability.
3. Fault Tolerance. If we can neither avoid faults as we develop software pro- ducts, nor remove them from the product after development, we ought to tol- erate them and learn to live with them. Fault tolerance consists in admitting the presence of faults in operating software products but taking steps to ensure that faults do not cause failures. This is possible if we monitor program states for any sign that a fault has caused an error and we intervene upon detecting an error to ensure that we avoid failure. Fault tolerance includes run-time steps, namely error detection, damage assessment, and error recovery; it also includes off-line steps, which are to analyze error reports to diagnose the fault that may have caused the error.
Each of these three families of methods has its strengths and weaknesses. The Law of Diminishing Returns advocates using them in concert, deploying each one where it is most effective. The focus of this book is on fault removal, but we may overstep the boundaries of fault removal to the extent that program testing includes any technique that involves observing and analyzing the behavior of candidate pro- grams in execution.
118 FAILURES, ERRORS, AND FAULTS