3
change it back? In this case, it is often better to make a note of the additional changes you have made and then proceed with your troubleshooting.
A key element to successful debugging is to control the focus of your investigation so that you are really dealing with the problem. You can usually focus better if you can break the problem into pieces.
Swapping components, as mentioned previously, is an example of this approach. This technique is known by several names—problem decomposition, divide and conquer, binary search, and so on. This
approach is applicable to all kinds of troubleshooting. For example, when your car wont start, first decide whether you have an electrical or fuel supply problem. Then proceed accordingly.
Chapter 12 outlines a series of specific steps you might want to consider.
System Failures
The troubleshooting I have described so far can be seen roughly as dealing with normal failures although there may be nothing terribly normal about them. A second general class
of problems is known as system failures. System failures are problems that stem from the interaction of the parts of a complex system in unexpected ways. They are most often seen
when two or more subsystems fail at about the same time and in ways that interact. However, system failures can result through interaction of subsystems without any
ostensible failure in any of the subsystems.
A classic example of a system failure can be seen in the movie China Syndrome. In one scene the reactor scrams, the pumps shut down, and the water-level indicator on a strip-
chart recorder sticks. The water level in the reactor becomes dangerously low due to the pump shutdown, but the problem is not recognized because the indicator gives misleading
information. These two near-simultaneous failures conceal the true state of the reactor.
System failures are most pernicious in systems with tight coupling between subsystems and subsystems that are linked in nonlinear or nonobvious ways. Debugging a system failure
can be extremely difficult. Many of the more standard approaches simply dont work. The strategy of decomposing the system into subsystems becomes difficult, because the
symptoms misdirect your efforts. Moreover, in extreme cases, each subsystem may be operating correctly—the problem stems entirely from the unexpected interactions.
If you suspect you have a system failure, the best approach, when feasible, is to substitute entire subsystems. Your goal should not be to look for a restored functioning system, but to
look for changes in the symptoms. Such changes indicate that you may have found one of the subsystems involved. Conversely, if you are working with a problem and the symptoms
change when a subsystem is replaced, this is strong indication of a system failure.
Unfortunately, if the problem stems from unexpected interaction of nonfailing systems, even this approach will not work. These are extremely difficult problems to diagnose. Each
problem must be treated as a unique, special problem. But again, an important first step is collecting information.
1.2 Need for Troubleshooting Tools
4
The best time to prepare for problems is before you have them. It may sound trite, but if you dont understand the normal behavior of your network, you will not be able to identify anomalous behavior.
For the proper management of your system, you must have a clear understanding of the current behavior and performance of your system. If you dont know the kinds of traffic, the bottlenecks, or
the growth patterns for your network, then you will not be able to develop sensible plans. If you dont know the normal behavior, you will not be able to recognize a problems symptoms when you see
them. Unless you have made a conscious, aggressive effort to understand your system, you probably dont understand it. All networks contain surprises, even for the experienced administrator. You only
have to look a little harder.
It might seem strange to some that a network administrator would need some of the tools described in this book, and that he wouldnt already know the details that some of these tools provide. But there are
a number of reasons why an administrator may be quite ignorant of his network.
With the rapid growth of the Internet, turnkey systems seem to have grown in popularity. A fundamental assumption of these systems is that they are managed by an inexperienced administrator
or an administrator who doesnt want to be bothered by the details of the system. Documentation is almost always minimal. For example, early versions of Sun Microsystems Netra Internet servers, by
default, did not install the Unix manpages and came with only a few small manuals. Print services were disabled by default.
This is not a condemnation of turnkey systems. They can be a real blessing to someone who needs to go online quickly, someone who never wants to be bothered by such details, or someone who can
outsource the management of her system. But if at some later time she wants to know what her turnkey system is doing, it may be up to her to discover that for herself. This is particularly likely if
she ever wants to go beyond the basic services provided by the system or if she starts having problems.
Other nonturnkey systems may be customized, often heavily. Of course, all these changes should be carefully documented. However, an administrator may inherit a poorly documented system. And, of
course, sometimes we do this to ourselves. If you find yourself in this situation, you will need to discover or rediscover your system for yourself.
In many organizations, responsibilities may be highly partitioned. One group may be responsible for infrastructure such as wiring, another for network hardware, and yet another for software. In some
environments, particularly universities, networks may be a distributed responsibility. You may have very little control, if any, over what is connected to the network. This isnt necessarily bad—its the
way universities work. But rogue systems on your network can have annoying consequences. In this situation, probably the best approach is to talk to the system administrator or user responsible for the
system. Often he will be only too happy to discuss his configuration. The implications of what he is doing may have completely escaped him. Developing a good relationship with power users may give
you an extra set of eyes on your network. And, it is easier to rely on the system administrator to tell you what he is doing than to repeatedly probe the network to discover changes. But if this fails, as it
sometimes does, you may have to resort to collecting the data yourself.
Sometimes there may be some unexpected, unauthorized, or even covert changes to your network. Well-meaning individuals can create problems when they try to help you out by installing equipment
themselves. For example, someone might try installing a new computer on the network by copying the network configuration from another machine, including its IP address. At other times, some volunteer
administrator simply has her own plans for your network.
Finally, almost to a person, network administrators must teach themselves as they go. Consequently, for most administrators, these tools have an educational value as well as an administrative value. They
5
provide a way for administrators to learn more about their networks. For example, protocol analyzers like ethereal provide an excellent way to learn the inner workings of a protocol like TCPIP. Often,
more than one of these reasons may apply. Whatever the reason, it is not unusual to find yourself reading your configuration files and probing your systems.
1.3 Troubleshooting and Management