TCP Congestion Control Implementation Issues

The considerations for setting the window size are varied. On one hand is the desire to achieve the maximal possible bandwidth. As the bandwidth is limited by BW ≤ windowRT T , situations in which the RTT is large imply the use of a large window and long timeouts. For example, this is the case when satellite links are employed. Such links have a propagation delay of more than 0.2 seconds, so the round-trip time is nearly half a second. The delay on terrestrial links, by contrast, is measured in milliseconds. Exercise 171 Does this mean that high-bandwidth satellite links are useless? On the other hand, a high RTT can be the result of congestion. In this case it is better to reduce the window size, in order to reduce the overall load on the network. Such a scheme is used in the flow control of TCP, and is described below. An efficient implementation employs linked lists and IO vectors The management of buffer space for packets is difficult for two reasons: • Packets may come in different sizes, although there is an upper limit imposed by the network. • Headers are added and removed by different protocol layers, so the size of the message and where it starts may change as it is being processed. It is desirable to handle this without copying it to a new memory location each time. The solution is to store the data in a linked list of small units, sometimes called mbufs, rather than in one contiguous sequence. Adding a header is then done by writing it in a separate mbuf, and prepending it to the list. The problem with a linked list structure is that now it is impossible to define the message by its starting address and size, because it is not contiguous in memory. Instead, it is defined by an IO vector: a list of contiguous chunks, each of which is defined by its starting point and size.

13.2.3 TCP Congestion Control

A special case of flow control is the congestion control algorithm used in TCP. As this is used throughout the Internet it deserves a section in itself. Note that such congestion control is a very special form of resource management by an operating system. It has the following unique properties: Cooperation — the control is not exercised by a single system, but rather flows from the combined actions of all the systems involved in using the Internet. It only works because they cooperate and follow the same rules. 225 External resource — the resource being managed, the bandwidth of the communi- cation links, is external to the controlling systems. Moreover, it is shared with the other cooperating systems. Indirection — the controlling systems do not have direct access to the controlled resource. They can’t measure its state, and they can’t affect its state. They need to observe the behavior of the communication infrastructure from the outside, and influence it from the outside. This is sometimes called “control from the edge”. Congestion hurts The reason that congestion needs to be controlled or rather, avoided is that it has observable consequences. If congestion occurs, packets are dropped by the network actually, by the routers that implement the network. These packets have to be re-sent, incurring additional overhead and delays. As a result, communication as a whole slows down. Even worse, this degradation in performance is not graceful. It is not the case that when you have a little congestion you get a little degradation. Instead, you get a positive feedback effect that makes things progressively worse. The reason is that when a packet is dropped, the sending node immediately sends it again. Thus it increases the load on a network that is overloaded to begin with. As a result even more packets are dropped. Once this process starts, it drives the network into a state where most packets are dropped all the time and nearly no traffic gets through. The first time this happened in the Internet October 1986, the effective bandwidth between two sites in Berkeley dropped from 32 Kbs to 40 bs — a drop of nearly 3 orders of magnitude [3]. Acks can be used to assess network conditions A major problem with controlling congestion is one of information. Each transmitted packet may traverse many different links before reaching its destination. Some of these links may be more loaded than others. But the edge system, the one performing the original transmission, does not know which links will be used and what their load conditions will be. The only thing that that a transmitter knows reliably is when packets arrive to the receiver — because the receiver sends an ack. Of course, it takes the ack some time to arrive, but once it arrives the sender knows the packet went through safely. This implies that the network is functioning well. On the other hand, the sender doesn’t get any explicit notification of failures. The fact that a packet is dropped is not announced with a nack only packets that arrive but have errors in them are nack’d. Therefore the sender must infer that a packet was dropped by the absence of the corresponding ack. This is done with a timeout 226 mechanism: if the ack did not arrive within the prescribed time, we assume that the packet was dropped. The problem with timeouts is how to set the time threshold. If we wait and an ack fails to arrive, this could indicate that the packet was dropped. but it could just be the result of a slow link, that caused the packet or the ack to be delayed. The question is then how to distinguish between these two cases. The answer is that the threshold should be tied to the round-trip time RTT: the higher the RTT, the higher the threshold should be. Details: estimating the RTT Estimating the RTT is based on a weighted average of measurements, similar to esti- mating CPU bursts for scheduling. Each “measurement” is just the time from sending a packet to receiving the corresponding ack. Given our current estimate of the RTT r, and a measurement m, the updated estimate will be r new = αm + 1 − αr This is equivalent to using exponentially decreasing weights for more distant measure- ments. If α = 1 2 , the weights are 1 2 , 1 4 , 1 8 , . . .. If α is small, more old measurements are taken into account. If it approaches 1, more emphasis is placed on the most recent mea- surement. The reason for using an average rather than just one measurement is that Internet RTTs have a natural variability. It is therefore instructive to re-write the above expression as [3] r new = r + αm − r With this form, it is natural to regard r as a predictor of the next measurement, and m − r as an error in the prediction. But the error has two possible origins: 1. The natural variability of RTTs “noise”, which is assumed to be random, so the noise in successive measurements will cancel out, and the estimate will converge to the correct value. 2. A bad prediction r, maybe due to insufficient data. The factor α multiplies the total error. This means that we need to compromise. We want a large α to get the most out of the new measurement, and take a big step towards the true average value. But this risks amplifying the random error too much. It is therefore recommended to use relatively small values, such as 0.1 ≤ α ≤ 0.2. This will make the fluctuations of r be much smaller than the fluctuations in m, at the price of taking longer to converge. In the context of setting the threshold for timeouts, it is important to note that we don’t really want an estimate of the average RTT: we want the maximum. Therefore we also need to estimate the variability of m, and use a threshold that takes this variability into account. 227 Start slowly and build up The basic idea of TCP congestion control is to throttle the transmission rate: do not send more packets than what the network can handle. To find out how much the network can handle, each sender starts out by sending only one packet and waiting for an ack. If the ack arrives safely, two packets are sent, then four, etc. In practical terms, controlling the number of packets sent is done using a window algorithm, just like in flow control. Congestion control is therefore just an issue of selecting the appropriate window size. In actual transmission, the system uses the minimum of the flow-control window size and the congestion control window size. The “slow start” algorithm used in TCP is extremely simple. The initial window size is 1. As each ack arrives, the window size is increased by 1. that’s it. Exercise 172 Adding 1 to the window size seems to lead to linear growth of the trans- mission rate. But this algorithm actually leads to exponential growth. How come? Hold back if congestion occurs The problem with slow start is that it isn’t so slow, and is bound to quickly reach a window size that is larger than what the network can handle. As a result, packets will be dropped and acks will be missing. When this happens, the sender enters “congestion avoidance” mode. In this mode, it tries to converge on the optimal window size. This is done as follows: 1. Set a threshold window size to half the current window size. This is the previous window size that was used, and worked OK. 2. Set the window size to 1 and restart the slow-start algorithm. This allows the network time to recover from the congestion. 3. When the window size reaches the threshold set in step 1, stop increasing it by 1 on each ack. Instead, increase it by 1 w on each ack, where w is the window size. This will cause the window size to grow much more slowly — in fact, now it will be linear, growing by 1 packet on each RTT. The reason to continue growing is twofold: first, the threshold may simply be too small. Second, conditions may change, e.g. a competing communication may terminate leaving more bandwidth available. To read more: The classic on TCP congestion avoidance and control is the paper by Jacobson [3]. An example of recent developments in this area is the paper by Paganini et al. [4]. 228

13.2.4 Routing