OVERVIEW OF CROSSPOINT-BUFFERED SWITCHES OVERVIEW OF INPUT

CROSSPOINT BUFFERED SWITCHES 228 Fig. 8.1 Crosspoint-buffered switch structure based on round-robin arbitration.

8.1 OVERVIEW OF CROSSPOINT-BUFFERED SWITCHES

A basic crosspoint-buffered switch architecture is shown in Figure 8.1. Each crosspoint has a buffer to store cells that come from the associated input port and are destined for the associated output port. Contention control is needed to resolve contention among crosspoint buffers that belong to the same output port. One candidate for the con- Ž . tention control is to use round-robin RR arbitration. This is because the RR arbitration provides fairness and its implementation is very simple. The RR arbiter searches, from some starting point, for a crosspoint buffer that has made a request to transfer a cell to the output line. The starting point is just below the crosspoint buffer from which a cell was sent to the output line at the previous cell time. If the RR arbiter finds the request, the cell at the head in the crosspoint buffer is selected to release its cell. At the next cell time, the starting point is reset to just below the selected crosspoint buffer. Thus, in the worst case, the control signal for ring arbitration must pass through all the crosspoint buffers belonging to the same output line within one cell time. For that reason, in the buffered crossbar that employs RR arbitration for the contention control, the maximum output-line speed is limited by the number of input ports, or switch size, and the transmission delay of the control signals in each crosspoint. Ž . The maximum output-line speed C bitsrs is given by the following max equation: L C s , 8.1 Ž . max NT s Ž . where the number of input ports in other word, the switch size is N, the Ž . transmission delay of the control signals in a crosspoint is T s , and the s SCALABLE DISTRIBUTED-ARBITRATION SWITCH 229 Ž . length of a cell is L bits . T depends on the performance of devices and the s length between crosspoints. When we construct a large-scale switch, its crossbar function can not be implemented on one chip, due to constraints from memory and gate amounts and the number of IrO pins. Therefore, we need to connect several chips to construct a large-scale switch. As N increases, C decreases. For example, at T s 3.0 ns and N s 16, max s C is 8.8 Gbitrs, when we set L to 53 = 8 bits. Thus, since the crosspoint- max buffered switch employs RR arbitration, the arbitration time limits the output-line speed according to the number of input ports to ensure that the RR arbitration can be completed within one cell time. As a result, unless T s is made small by using ultrahigh-speed devices, the RR based switch cannot achieve large throughput.

8.2 SCALABLE DISTRIBUTED-ARBITRATION SWITCH

Ž . This section describes a scalable distributed-arbitration SDA switch, to solve the problem of the RR based switch as described Section 8.1. The SDA switch was developed by Nippon Telegraph and Telephone Corporation Ž . w x NTT 2 . 8.2.1 SDA Structure Figure 8.2 shows the structure of the SDA switch. The SDA switch has a Ž . crosspoint buffer, a transit buffer, an arbitration-control part CNTL , and a selector at every crosspoint. Ž . A crosspoint buffer sends a request REQ to CNTL if there is at least one cell stored in the crosspoint buffer. A transit buffer stores several cells that are sent from either the upper crosspoint buffer or the upper transit buffer. The transit buffer size is one or a few cells, so that both overflow and underflow can be avoided. The required transit buffer size is determined by the round-trip delay of control signals between two adjacent crosspoints. The transit buffer sends REQ to CNTL, as does the crosspoint buffer, if there is at least one cell stored in the transit buffer. If the transit buffer is full, it Ž . sends not-acknowledgment NACK to the upper CNTL. If there are any REQs and CNTL does not receive NACK from the next lower transit buffer, CNTL selects a cell within one cell time. CNTL determines which cell should be sent according to the following cell selection rule. The selected cell is sent through a selector to the next lower transit buffer or the output line. The rule selects a cell as follows. If either the crosspoint buffer or the transit buffer requests cell release, the cell in the requesting buffer is selected. If both the crosspoint buffer and the transit buffer request cell release, the cell with the larger delay time is selected. The delay time is defined as the time since the cell entered the crosspoint buffer. CROSSPOINT BUFFERED SWITCHES 230 Ž . Fig. 8.2 Scalable distributed-arbitration switch structure. 䊚1997 IEEE. One way of comparing the delay time of competitive cells is to use asynchronous counter, which needs S bits, and also the same overhead bit in each cell. The synchronous counter is incremented by one in each cell time. All the synchronous counter’s values are synchronized. When a cell enters a crosspoint buffer, the value of the synchronous counter is written in the overhead of the cell. When both a crosspoint buffer and a transit buffer issue requests for cell release, the values of both counters are compared. If the difference in values is more than 2 Sy1 , the cell with smaller value is selected. To the contrary, if the difference is equal to or less than 2 Sy1 , the cell with larger value is selected. Under the condition that the maximum delay time is less than 2 Sy1 , this delay-time comparison works. As will be explained in the next section, S s 8 is sufficiently large in the SDA switch. When the delay time of the cell in the crosspoint buffer equals that in the transit buffer, CNTL determines which cell should be sent using the second cell selection rule. Let us consider the k th crosspoint and transit buffers SCALABLE DISTRIBUTED-ARBITRATION SWITCH 231 counting from the top. The second rule is that the k th crosspoint buffer is selected with probability 1rk, while the k th transit buffer is selected with Ž . probability of k y 1 rk. For example, the third crosspoint buffer and the 1 2 transit buffer are selected with probabilities and , respectively. 3 3 Thus the SDA switch achieves distributed arbitration at each crosspoint. The longest control signal transmission distance for arbitration within one cell time is obviously the distance between two adjacent crosspoints. In the conventional switch, the control signal for ring arbitration must pass through all crosspoint buffers, belonging to the same output line. For that reason, the arbitration time of the SDA switch does not depend on the number of input ports.

8.2.2 Performance of SDA Switch

SDA switch performance was evaluated in terms of delay time and crosspoint buffer size by computer simulation. It is assumed that, in an N = N cross- point-buffered switch, the input traffic is random, the input load is 0.95, and cells are distributed uniformly to all crosspoint buffers belonging to the same input line. The SDA switch ensures delay time fairness. Figure 8.3 shows the proba- bility of the delay time being larger than d at N s 8. The probability is shown for each crosspoint buffer entered by cells. The delay time is defined as the time from the cell’s entering the crosspoint buffer until it reaches the Ž . Fig. 8.3 Delay performance of SDA switch. 䊚1997 IEEE. CROSSPOINT BUFFERED SWITCHES 232 Ž . Fig. 8.4 Maximum delay time. 䊚1997 IEEE. output line. In the SDA switch, when d is more than about 10 cell times, all delay times have basically the same probability and delay time fairness is Ž achieved. Since it takes at least N s 8 cell times for the cell in the top crosspoint buffer to enter the output line, fairness is not maintained at . smaller values. In addition, when d is larger than a certain time, the probability of the SDA switch delay time being larger than d is smaller than that of the RR switch, as shown in Figure 8.3. This is because, in the SDA switch, the cell with the largest delay time is selected. This effect becomes clearer as N increases. Figure 8.4 shows that the Ž y 4 . maximum delay time 10 quantile of the SDA does not change very much when N increases, while that of the RR switch increases rapidly. Further- 7 Ž . more, maximum SDA delay is smaller than 2 s 128 cell times even at large N. This means that synchronous counter size is just S s 8, as men- tioned before. The required crosspoint buffer size of the SDA switch is smaller than that of the switch, as shown in Figure 8.5. The required buffer sizes were estimated so as to guarantee the cell loss ratio of 10 y 9 . In the SDA switch, since the required buffer sizes differ for the crosspoint buffers, Figure 8.5 Ž . Ž shows the smallest top crosspoint buffer and the largest bottom crosspoint . buffer sizes. The sizes of the intermediate crosspoint buffers lie between these two values. Because the SDA switch has shorter delay time as ex- plained before, the queue length of the crosspoint buffer is also reduced. This is why the crosspoint buffer size of the SDA switch is less than that of the RR switch. The switch throughput of the SDA switch increases as the switch size N increases, as shown in Figure 8.6. Since the arbitration time does not limit SCALABLE DISTRIBUTED-ARBITRATION SWITCH 233 Ž . Fig. 8.5 Required buffer size. 䊚1997 IEEE. the output-line speed, the SDA switch can be expanded to achieve high switch throughput even if N is large. The switch throughput is calculated as C N, where C is the maximum output line speed. max max On the other hand, the switch throughput of the RR-based switch does not increase when N becomes large. Instead it depends on the transmission delay of the control signal in a crosspoint. The RR arbitration time limits the output line speed. The RR-based switch is not expandable, because of the limitation of the RR arbitration time. Ž . Fig. 8.6 Switch throughput vs. switch size. 䊚1997 IEEE. CROSSPOINT BUFFERED SWITCHES 234

8.3 MULTIPLE-QOS SDA SWITCH

Section 8.2 describes an SDA switch, that can support a single QoS class. Ž . This section describes a multiple-QoS SDA MSDA , to support multiple QoS classes by extending the concept of the SDA switch. The MSDA switch w x was presented in 4, 5 . We call the single-QoS SDA switch described in Section 8.2 SSDA in order to differentiate it from MSDA. To support multiple QoS classes, a priority queuing control at each crosspoint buffer is needed. One priority queuing approach is strict priority control. Consider two priority buffers. Under the strict priority system, cells Ž . waiting in the low-priority buffer delay-tolerant are served only if there are Ž . no cells awaiting transmission in the high-priority delay-sensitive buffer. Therefore, in the strict priority discipline, the low-priority traffic effectively uses the residual bandwidth. However, a problem occurs when we use a SSDA mechanism in a strict priority system that supports multiple QoS classes. The delay time of cells in the low-priority buffer will be very large, and the maximum delay time cannot be designed. Therefore, we cannot use the delay-time-based cell selection mechanism, as is used in the SSDA switch, for the low-priority class, due to the limitation on the number of bits for the delay measure in the cell header. The MSDA switch was developed to support high- and low-priority classes. In order to solve the problem of a cell selection mechanism for the low-prior- ity class, NTT introduced a distributed RR-based cell selection mechanism at each crosspoint for the low-priority class, which avoids using a synchronous w x counter such as is used for the high-priority class 4, 5 . The low-priority transit buffer at each crosspoint has virtual queues in accordance with the upper input ports. Cells for the low-priority class are selected by distributed ring arbitration among the low-priority crosspoint buffer and the virtual queues at the low-priority transit buffer. For the high-priority class, the same delay-time-based cell selection mechanism is used as in the SSDA switch. As a result, the proposed MSDA switch ensures fairness in terms of delay time for the high-priority class, while it ensures fairness in terms of throughput for the low-priority class.

8.3.1 MSDA Structure

This subsection describes the structure of the MSDA switch. Although we describe two priority classes in this paper for simplicity, we can easily extend the number of priority classes to more than two. The low-priority class tolerates delay, while the high-priority class requires a small delay time. In addition, the low-priority class is supposed to be a Ž . best-effort service class such as the unspecified bit rate UBR class. It requires fairness in terms of throughput rather than in terms of delay time, in order to effectively use the residual bandwidth that is not used by the high-priority traffic. Therefore, it needs a cell selection mechanism that MULTIPLE-QOS SDA SWITCH 235 Ž . Ž . Fig. 8.7 Multi-QoS SDA MSDA switch structure. 䊚1999 IEEE. preserves fairness in terms of delay time for the high-priority buffer and in terms of throughput for the low-priority class. In order to avoid the delay-time-based cell selection mechanism for the low-priority class, a distributed RR-based cell selection mechanism at each crosspoint for the low-priority class is used. Figure 8.7 shows the structure of the MSDA switch at the k th crosspoint. The MSDA switch has a crosspoint buffer and a transit buffer, each consist- ing of a high-priority buffer and a low-priority buffer, an arbitration-control Ž . part CNTL , and a selector at every crosspoint. Ž . A cell that passes an address filter AF enters into either the high- or the low-priority crosspoint buffer according to its priority class. At that time, at the high-priority crosspoint buffer, the value of a synchronous counter is written into the cell overhead, as in the SSDA switch. On the other hand, at Ž . the low-priority buffer, an input port identifier ID is written. For example, at the k th crosspoint, the value of the input port ID is k. This is used to distinguish which input port a cell comes from. The high- and low-priority crosspoint buffers send REQ to CNTL if there is at least one cell stored in each buffer. A cell that is transmitted from the upper crosspoint enters either the high-priority transit buffer or the low-priority crosspoint buffer according to the priority class. The low-priority transit buffer has k y 1 virtual queues, which are numbered 1, 2, . . . , k y 1. A low-priority cell that has input port Ž . ID i 1 F i F k y 1 enters virtual queue i. The high-priority transit buffer and the low-priority transit virtual queues send REQ to CNTL if there is at CROSSPOINT BUFFERED SWITCHES 236 Ž . Fig. 8.8 Low-priority selection rule. 䊚1999 IEEE. least one cell stored in each buffer or virtual queue. If the high- or low-priority transit buffers are about to become full, they send not-acknowl- edgments NACK H and NACK L, respectively, to the upper CNTL. ᎐ ᎐ The cell selection algorithm in the MSDA switch is as follows. If CNTL receives NACK dH from the lower high-priority transit buffer, neither a ᎐ high-priority cell nor a low-priority cell is transmitted. This is because, when the lower high-priority transit buffer is about to become full, there is no chance for the low-priority cell in the lower transit buffer to be transmitted. Low-priority cells cannot be transmitted when there is at least one high-prior- ity REQ from the crosspoint buffer and the transit buffer. When both the high-priority crosspoint buffer and the high-priority transit buffer send REQs to CNTL, the high-priority cell selection rule used is the cell selection rule used in the SSDA switch. Low-priority cells can be transmitted only when there are no REQs from either the high-priority crosspoint buffer or the high-priority transit buffer. If this condition is satisfied and CNTL does not receive either NACK H or ᎐ NACK L from the lower transit buffer, then the low-priority selection rule is ᎐ used. The low-priority crosspoint buffer and virtual queues in the low-priority transit buffer send REQs to CNTL as shown in Figure 8.8. Ring arbitration is executed at each crosspoint in a distributed manner. CNTL selects a cell and transmits it to the lower transit buffer. Thus the MSDA switch achieves distributed arbitration at each crosspoint. It uses the delay-time-based cell selection rule for the high-priority buffer and the distributed RR-based cell selection rule for the low-priority class.

8.3.2 Performance of MSDA Switch

The performance of the MSDA switch is described. It is assumed that, in an N = N crosspoint-buffered switch, input traffic for both the high- and low- priority classes is random, and cells are distributed uniformly to all crosspoint buffers belonging to the same input line. MULTIPLE-QOS SDA SWITCH 237 TABLE 8.1 Throughput in MSDA Switch Case 1 Input Load Throughput Inport Port High Low High Low 1 0.060 0.050 0.060 0.050 2 0.060 0.050 0.060 0.050 3 0.060 0.150 0.060 0.050 4 0.180 0.050 0.180 0.050 5 0.060 0.050 0.060 0.050 6 0.060 0.050 0.060 0.050 7 0.060 0.050 0.060 0.050 8 0.060 0.050 0.060 0.050 Total 0.600 0.500 0.600 0.400 TABLE 8.2 Throughput in MSDA Switch Case 2 Input Load Throughput Inport Port High Low High Low 1 0.060 0.030 0.060 0.030 2 0.060 0.030 0.060 0.030 3 0.060 0.400 0.060 0.190 4 0.180 0.030 0.180 0.030 5 0.060 0.030 0.060 0.030 6 0.060 0.030 0.060 0.030 7 0.060 0.030 0.060 0.030 8 0.060 0.030 0.060 0.030 Total 0.600 0.600 0.600 0.400 Since the high-priority is not influenced by, but does influence the low-pri- ority class, the results of the high-priority class are the same as those of the SSDA switch. Therefore, only the performance for the low-priority buffer is presented here. Tables 8.1 and 8.2 show that the MSDA switch keeps the fairness in terms of the throughput for the low-priority class. We present results for two traffic conditions, case 1 and case 2. The switch size was set to N s 8. In case 1, the high-priority load of the fourth input port is 0.18 and that of other input ports is 0.06. The low-priority load of the third input port is 0.15 and that of other input ports is 0.05, as shown in Table 8.1. The total input Ž . load is 1.1 0.6 q 0.5 , which is overloaded. The output load, which we call the throughput, for the high-priority class is the same as the high-priority input load for each input port. The low-priority throughput of all input ports is equally divided into 0.05 to utilize the residual bandwidth. Thus, the residual bandwidth is fairly shared with all the low-priority input traffic, although its requests for bandwidth are different. CROSSPOINT BUFFERED SWITCHES 238 In case 2, the low-priority load of the third input port is 0.4 and that of other input ports is 0.03, as shown in Table 8.2. The high-priority input load Ž . is the same as in case 1. The total input load is 1.2 0.6 q 0.6 , which is also overloaded. The low-priority throughput for input ports except for the third input port is 0.03, which is the same as the input load, and the low-priority throughput for the third input port is 0.19, which is larger than 0.03. The low-priority throughput is first equally divided into 0.03, which satisfies the input ports except for the third. Since some bandwidth remains, the residual bandwidth is given to the third input port. Therefore, the low-priority throughput of the third input port is 0.19. This means that the MSDA switch achieves max᎐min fair share for the low-priority class. REFERENCES 1. H. J. Chao, B.-S. Choe, J.-S Park, and N. Uzun, ‘‘Design and implementation of abacus switch: a scalable multicast ATM switch,’’ IEEE J. Selct. Areas Commun., vol. 15, no. 5, pp. 830᎐843, 1997. 2. E. Oki and N. Yamanaka, ‘‘Scalable crosspoint buffering ATM switch architecture using distributed arbitration scheme,’’ Proc. IEEE ATM ’97 Workshop, pp. 28᎐35, 1997. 3. E. Oki and N. Yamanaka, ‘‘A high-speed ATM switch based on scalable dis- tributed arbitration,’’ IEICE Trans. Commun., vol. E80-B, no. 9, pp. 1372᎐1376, 1997. 4. E. Oki, N. Yamanaka, and M. Nabeshima, ‘‘Scalable-distributed-arbitration ATM switch supporting multiple QoS classes,’’ Proc. IEEE ATM ’99 Workshop, 1999. 5. E. Oki, N. Yamanaka, and M. Nabeshima, ‘‘Performance of scalable-distributed- arbitration ATM switch supporting multiple QoS classes,’’ IEICE Trans. Commun., vol. E83-B, no. 2, pp. 204᎐213, 2000. 6. H. Tomonaga, N. Matsuoka, Y. Kato, and Y. Watanabe, ‘‘High-speed switching module for a large capacity ATM switching system,’’ Proc. IEEE GLOBECOM ’9 2, pp. 123᎐127, 1992. H. Jonathan Chao, Cheuk H. Lam, Eiji Oki Copyright 䊚 2001 John Wiley Sons, Inc. Ž . Ž . ISBNs: 0-471-00454-5 Hardback ; 0-471-22440-5 Electronic CHAPTER 9 THE TANDEM-CROSSPOINT SWITCH The HOL blocking problem in input-buffered switches can be eliminated by using the parallel-switch technique, where one switch fabric consists of multiple switch planes. The switch fabric operates at the line rate, and thus the arbitration timing is relaxed compared with the internal speedup switch architecture. However, the parallel-switch architecture suffers from a cell-out-of- sequence problem at output ports. A resequencing circuit needs to be implemented at the output ports to ensure that cells are delivered in order. For example, timestamps can be carried in the cell headers and stored at output buffers. Ž . w x A tandem-crosspoint TDXP switch 11, 12 developed by NTT has logi- cally multiple crossbar switch planes. These switch planes are connected in tandem at every crosspoint. The TDXP switch achieves a high throughput without increasing the internal speed of switch fabric. It also preserves the cell-sequence order. The remainder of this chapter is as follows. Section 9.1 briefly reviews basic input and output buffered switch architectures. Section 9.2 presents the TDXP switch architecture. Section 9.3 shows its performance. Throughout Ž this chapter, we assume that the switch size is N = N N input ports and N . output ports . Input and output have the same line speed.

9.1 OVERVIEW OF INPUT

– OUTPUT-BUFFERED SWITCHES A switch with a crossbar structure can be easily scaled because of its modularity. 239 THE TANDEM-CROSSPOINT SWITCH 240 One can build a larger switch simply adding more crosspoint switch devices. In addition, the cell transmission delay in the switch is smaller than in Banyan-type switches. This is because it has the smallest number of connecting points between any input᎐output pair. Variants of crossbar-type switches include the input-buffered switch and the output-buffered switch. The advantage of the former is that the operation Fig. 9.1 Basic input᎐output-buffered switches. TDXP STRUCTURE