•
malignant viruses
•
propagation engine
•
scripting hosts
•
self-replicating
•
shell
•
signature
•
virus scanner
•
worms
Chapter 9: Creating Fault Tolerance
Overview
Security means more than just keeping hackers out of your computers. It really means keeping your data safe from loss of any kind, including accidental loss due to user error, bugs in
software, and hardware failure.
Systems that can tolerate hardware and software failure without losing data are said to be fault tolerant. The term is usually applied to systems that can remain functional when hardware or
software errors occur, but the concept of fault tolerance can include data backup and archiving systems that keep redundant copies of information to ensure that the information isnt lost if
the hardware it is stored upon fails.
Fault tolerance theory is simple: Duplicate every component that could be subject to failure. From this simple theory springs very complex solutions, like backup systems that duplicate all
the data stored in an enterprise, clustered servers that can take over for one another automatically, redundant disk arrays that can tolerate the failure of a disk in the pack without
going offline, and network protocols that can automatically reroute traffic to an entirely different city in the event that an Internet circuit fails.
Causes for Loss
To correctly plan for fault tolerance, you should consider what types of loss are likely to occur. Different types of loss require different fault tolerance measures, and not all types of
loss are likely to occur to all clients.
fault tolerance
The ability of a system to withstand failure and remain operational. At the end of each of these sections, there will be a tip box that lists the fault tolerance
measures that can effectively mitigate these causes for loss. To create an effective fault tolerance policy, rank the following causes for loss in the order that you think they’re likely to
occur in your system. Then list the effective remedy measures for those causes for loss in the same order, and implement those remedies in top-down order until you exhaust your budget.
Note The solutions mentioned in this section are covered in the second half of this chapter.
Human Error
User error is the most common reason for loss. Everyone has accidentally lost information by deleting a file or overwriting it with something else. Users frequently play with configuration
settings without really understanding what those settings do, which can cause problems as well. Believe it or not, most computer downtime in businesses is caused by the activities of
the computer maintenance staff. Deploying patches without testing them first can cause servers to fail, performing maintenance during working hours can cause bugs to manifest and
servers to crash. Leading edge solutions are far more likely to have undiscovered problems, and routinely selecting them over more mature solutions means that your systems will be less
stable.
Tip A good archiving policy provides the means to recover from human error easily. Use permissions to prevent users’ mistakes from causing widespread damage.
Routine Failure Events
Routine failure events are the second most likely causes for loss. Routine failures fall into a few categories that are each handled differently.
Hardware Failure
Hardware failure is the second most common reason for loss and is highly likely to occur in servers and client computers. Hardware failure is considerably less likely to occur in devices
that do not contain moving parts.
The primary rule of disk management is: Stay in the mass market—don’t get esoteric. Unusual solutions are harder to maintain, are more likely to have buggy drivers, and are
usually more complex than they are worth.
Every hard disk will eventually fail. This bears repeating: Every hard disk will eventually fail. They run constantly in servers at high speed, and they generate the very heat that destroys
their spindle lubricant. These two conditions combine to ensure that hard disks wear out through normal use within about 10 years.
Note Early in the computer industry, the Mean Time Between Failure MTBF of a hard
disk drive was an important selling point.
Mean Time Between Failures MTBF
The average life expectancy of electronic equipment. Most hard disks have an MTBF of about five years.
The real problem with disk failure is that hard disks are the only component in computers that can’t be simply swapped out because they are individually customized with your data. To
tolerate the failure of your data, you must have a copy of it elsewhere. That elsewhere can be another hard disk in the same computer or in another computer, on tape, or on removable
media.
removable media