Mean time between failures

30

2.2.7.1 Mean time between failures

One of the most important numbers your equipment manufacturer quotes in the specification sheet is the Mean Time Between Failures MTBF. But this value is frequently misunderstood and misused. So I will discuss the concept a little bit before going on. This number just represents a statistical likelihood. It means that half because its a statistical mean of all equipment of this type will no longer be functioning after this length of time. It does not mean that sudden and catastrophic failure will occur at the stroke of midnight. Failure can happen at any time. But just giving an average without saying anything about the shape of the curve makes it difficult to work with. Figure 2-7 shows some possible versions of what the curve might look like. These curves plot the number of device failures as a function of time. There are N total devices, so at time MTBF, there are N2 devices remaining. The thick solid line represents a very ideal world where almost all of the gear survives right up until moments before the MTBF. Of course, the price for this is that a large number of devices then all fail at the same time. Figure 2-7. Mean time between failures, as it relates to probability of failure per unit of time The dashed line, on the other hand, shows a sort of worst-case curve, in which the same number of devices fail every day. This is probably not a realistic approximation either because there are a lot of devices that either dont work when you open the box or fail soon after. Then age will take a toll later as gear gradually burns out through heavy use. The dotted curve represents a more realistic curve. But the interesting thing is that, when you look at these curves, its clear that the dashed line isnt such a bad approximation after all. Its going to be close. And up until the MTBF time, it will tend to overestimate the probability of failure. Its always a good idea to overestimate when it comes to probability of failure, because the worst you can do is end up with an unusually stable and reliable network. Its also going to be the easiest to do calculations with. So the dashed line is the one I use for finding the most common failure modes. The slope of this line gives the failure rate, the number of failures per unit time, and because it is a straight line, the approximation assumes a constant failure rate. A little arithmetic shows that the line rises by N2 in a distance of MTBF, so the slope is N2 x MTBF. So, if the MTBF is 10 years, then you will expect to see 5 of your devices fail every year, on average. If the MTBF is 20 years, then the value drops to 2.5. Most network- equipment manufacturers quote an MTBF in this range. If you had only one device, then a 5 per year failure rate is probably quite acceptable. You may not care about redundancy. But this book is concerned with large-scale networks, networks with hundreds or 31 thousands of devices. At 5 per year, out of a network of 1000 devices, you will expect to see 50 failures per year. Thats almost one per week. The important point to draw from this is that the more devices you have, the greater the chances are that one of them will fail. So, the more single points of failure in the network, the greater the probability of a catastrophic failure.

2.2.7.2 Multiple simultaneous failures