Defining/Quantifying Reliability
13.3.2 Defining/Quantifying Reliability
The reliability of a software product reflects, broadly speaking, the product’s likelihood of operating free of failure for long periods of time. Whereas correctness is a logical property (a program is or is not correct), reliability is a probabilistic property (quantifying the likelihood that the program operates failure free for some unit of time). The first matter that we need to address in trying to define reliability is to decide what concept of time we are talking about. To this effect we consider three broad classes of software products, which refer to three different scales of time:
290 TEST OUTCOME ANALYSIS
1. A process control system, such as a system controlling a nuclear reactor, a chemical plant, an electric grid, a telephone switch, a flight control system, an autonomous vehicle, and so on. Such systems iterate constantly through
a control loop, whereby they probe sensors, analyze their input along with possible state data, compute control parameters and feed them to actuators. For such systems, time can be measured by clock time or possibly by the number of iterations that the system goes through; these two measures are related by a linear formula, since the sampling time for such system is usually fixed (e.g., take a sample of sensor data every 0.1 second).
2. A transaction processing system, such as an e-commerce system, an airline reservation system, or an online query system. Such system operates on a stimulus–response cycle, whereby they await user transactions, and whenever
a transaction arises, they process it, respond to it, and get ready for the next transaction. Because such systems are driven by user demands, there is no direct relation between actual time and the number of transaction cycles they go through; for such systems, the passage of time can be measured by the number of transactions they process.
3. Simple input/output programs, which carry no internal state, and merely com- pute an output from an input provided by a user, whenever they are invoked. For such software product, time is equated with number of invocations/executions.
Hence when we talk about time in the remainder of this chapter, we may be refer- ring to different measures for different types of programs. Another matter that we must pin down prior to discussing reliability is the matter of input to a program; referring back to the distinction made earlier about three families of software products, we observe that only the third family of software products operates exclusively on simple inputs; software products in the other two families operate on state information in addition to the current input data. When we talk about the input to a program, we refer generally to input data provided by the user as well as relevant state data or context/environment data.
As a measure of failure avoidance over time, reliability can be quantified in a number of ways, which we explore briefly:
• Probability that the execution of the product on a random input completes without failure. • Mean time to the next failure. • Mean time between failures. • Mean number of failures for a given period of time. • Probability of failure free-operation for a given period of time.
It is important to note that reliability is always defined with respect to an implicit (or sometimes, explicit) user profile (or usage pattern). A user profile is defined by
a probability distribution over the input space; if the input space is a discrete set, then the probability distribution is defined by means of a function from the set
13.3 STOCHASTIC CLAIMS: FAILURE PROBABILITY 291
to the interval [0.0 .. 1.0]; if the set is a continuous domain, then the probability distribution is defined by a function whose integral over the input space gives 1.0 (the integral over any subset of the input space represents the probability that the input falls in that subset). User profile (or usage pattern) is important in the study of reliability because the same system may have different reliabilities for different user profiles.
It is common, in the study of reliability, to classify failures into several categories, depending on the impact of the failure, ranging from minor inconvenience to a catastrophic impact involving loss of life, mission failure, national security threats, and so on. We postpone this aspect of the discussion to Section 13.4, where we explore an economic measure of reliability, which refines the concept of failure classification.