Mean time between failures
2.2.7.1 Mean time between failures
One of the most important numbers your equipment manufacturer quotes in the specification sheet is the Mean Time Between Failures MTBF. But this value is frequently misunderstood and misused. So I will discuss the concept a little bit before going on. This number just represents a statistical likelihood. It means that half because its a statistical mean of all equipment of this type will no longer be functioning after this length of time. It does not mean that sudden and catastrophic failure will occur at the stroke of midnight. Failure can happen at any time. But just giving an average without saying anything about the shape of the curve makes it difficult to work with. Figure 2-7 shows some possible versions of what the curve might look like. These curves plot the number of device failures as a function of time. There are N total devices, so at time MTBF, there are N2 devices remaining. The thick solid line represents a very ideal world where almost all of the gear survives right up until moments before the MTBF. Of course, the price for this is that a large number of devices then all fail at the same time. Figure 2-7. Mean time between failures, as it relates to probability of failure per unit of time The dashed line, on the other hand, shows a sort of worst-case curve, in which the same number of devices fail every day. This is probably not a realistic approximation either because there are a lot of devices that either dont work when you open the box or fail soon after. Then age will take a toll later as gear gradually burns out through heavy use. The dotted curve represents a more realistic curve. But the interesting thing is that, when you look at these curves, its clear that the dashed line isnt such a bad approximation after all. Its going to be close. And up until the MTBF time, it will tend to overestimate the probability of failure. Its always a good idea to overestimate when it comes to probability of failure, because the worst you can do is end up with an unusually stable and reliable network. Its also going to be the easiest to do calculations with. So the dashed line is the one I use for finding the most common failure modes. The slope of this line gives the failure rate, the number of failures per unit time, and because it is a straight line, the approximation assumes a constant failure rate. A little arithmetic shows that the line rises by N2 in a distance of MTBF, so the slope is N2 x MTBF. So, if the MTBF is 10 years, then you will expect to see 5 of your devices fail every year, on average. If the MTBF is 20 years, then the value drops to 2.5. Most network- equipment manufacturers quote an MTBF in this range. If you had only one device, then a 5 per year failure rate is probably quite acceptable. You may not care about redundancy. But this book is concerned with large-scale networks, networks with hundreds or 31 thousands of devices. At 5 per year, out of a network of 1000 devices, you will expect to see 50 failures per year. Thats almost one per week. The important point to draw from this is that the more devices you have, the greater the chances are that one of them will fail. So, the more single points of failure in the network, the greater the probability of a catastrophic failure.2.2.7.2 Multiple simultaneous failures
Parts
» Money Geography Business Requirements
» Installed Base Bandwidth Business Requirements
» Layer 1 Layer 2 The Seven Layers
» Layer 3 Layer 4 The Seven Layers
» Layer 5 Layer 6 Layer 7 The Seven Layers
» Routing Versus Bridging Networking Objectives
» Top-Down Design Philosophy Networking Objectives
» Failure Is a Reliability Issue
» Performance Is a Reliability Issue
» Guidelines for Implementing Redundancy
» Redundancy by Protocol Layer
» Multiple Simultaneous Failures Complexity and Manageability
» Always let network equipment perform network functions Intrinsic versus external automation
» Examples of automated fault recovery
» Fault tolerance through load balancing
» Avoid manual fault-recovery systems
» Isolating Single Points of Failure
» Multiple simultaneous failures Predicting Your Most Common Failures
» Combining MTBF values Predicting Your Most Common Failures
» Traffic Anomalies Failure Modes
» Software Problems Human Error
» Ring topology Basic Concepts
» Star topology Basic Concepts
» Mesh Topology Basic Concepts
» Spanning Tree eliminates loops Spanning Tree activates backup links and devices
» Protocol-Based VLAN Systems VLANs
» Why collapse a backbone? Backbone capacity
» Backbone redundancy Collapsed Backbone
» Trunk capacity Distributed Backbone
» Trunk fault tolerance Distributed Backbone
» Ancient history Switching Versus Routing
» One-armed routers and Layer 3 switches
» Filtering for security Filtering
» Filtering for application control
» Containing broadcasts Switching and Bridging Strategies
» Redundancy in bridged networks Filtering
» Trunk design VLAN-Based Topologies
» VLAN Distribution Areas VLAN-Based Topologies
» Sizing VLAN Distribution Areas
» Multiple Connections Implementing Reliability
» Routers in the Distribution Level Routers in Both the Core and Distribution Levels
» Connecting Remote Sites Large-Scale LAN Topologies
» General Comments on Large-Scale Topology
» Cost Efficiency Selecting Appropriate LAN Technology
» Installed Base Maintainability Selecting Appropriate LAN Technology
» Ethernet addresses Ethernet Framing Standards
» Collision Detection Ethernet and Fast Ethernet
» Transceivers Ethernet and Fast Ethernet
» FDDI Local Area Network Technologies
» Wireless Local Area Network Technologies
» Firewalls and Gateways Local Area Network Technologies
» Horizontal Cabling Structured Cabling
» Vertical Cabling Structured Cabling
» Network Address Translation IP
» Multiple Subnet Broadcast IP
» Unregistered Addresses General IP Design Strategies
» Easily summarized ranges of addresses
» Sufficient capacity in each range
» Standard subnet masks for common uses
» The Default Gateway Question
» Types of Dynamic Routing Protocols
» Split Horizons in RIP Variable Subnet Masks
» Basic Functionality IGRP and EIGRP
» Active and Stuck-in-Active Routes
» Interconnecting Autonomous Systems IGRP and EIGRP
» Interconnecting Autonomous Systems OSPF
» Redistributing with Other Routing Protocols
» IP Addressing Schemes for OSPF OSPF Costs
» Autonomous System Numbers BGP
» IPX Addressing Schemes General IPX Design Strategies
» RIP and SAP Accumulation Zones
» Using Equipment Features Effectively
» Hop Counts Elements of Efficiency
» Bottlenecks and Congestion Elements of Efficiency
» Filtering Elements of Efficiency
» QoS Basics Quality of Service and Traffic Shaping
» Layer 2 and Layer 3 QoS Buffering and Queuing
» Assured Forwarding in Differentiated Services
» Traffic Shaping Quality of Service and Traffic Shaping
» Defining Traffic Types Quality of Service and Traffic Shaping
» RSVP Quality of Service and Traffic Shaping
» Network-Design Considerations Quality of Service and Traffic Shaping
» Configuration Management Network-Management Components
» Fault Management Performance Management Security Management
» Designing a Manageable Network
» VLAN structures Architectural Problems
» LAN extension Architectural Problems
» Redundancy features Architectural Problems
» Out-of-Band Management Techniques Management Problems
» Multicast Addressing IP Multicast Networks
» Multicast Services IP Multicast Networks
» Group Membership IP Multicast Networks
» Multicast administrative zones Network-Design Considerations for Multicast Networks
» Multicast and QoS Network-Design Considerations for Multicast Networks
Show more