INPUT-BUFFERED SWITCHES
78
depart during the second phase. Similarly, cells B-3 and C-4 must depart w
Ž .x during the fourth and sixth phases, respectively see Fig. 3.22 b . Continuing
w Ž .
Ž .x this elimination process see Fig. 3.22 c and d , there is only one possible
scheduling order. For this input traffic pattern, the switch needs all seven
7
phases in four time slots, which corresponds to a minimum speedup of
4 1
Ž .
or 2 y . The proof of the general case for an N = N switch is a straight-
4
forward extension of the 4 = 4 example.
3.5 LOWEST-OUTPUT-OCCUPANCY-CELL-FIRST
ALGORITHM LOOFA
w x
The LOOFA is a work-conserving scheduling algorithm 16 . It provides 100 throughput and a cell delay bound for feasible traffic, using a speedup
of 2. An input᎐output-queued architecture is considered. Two versions of this scheme have been presented: the greedy and the best-first. This scheme
considers three different parameters associated with a cell, say cell c, to perform a match: the number of cells in its destined output queue, or output
Ž . Ž .
occupancy, OCC c ; the timestamp of a cell, or cell age, TS c ; and the smallest port number, to break ties. Under the speedup of 2, each time slot
has two phases. During each phase, the greedy ®ersion of this algorithm works Ž
. as follows see Fig. 3.23 for an example :
1. Initially, all inputs and outputs are unmatched. Ž
2. Each unmatched input selects an active VOQ i.e., a VOQ that has at .
least one cell queued going to the unmatched output with the lowest
Fig. 3.23 A matching example with the greedy LOOFA.
LOWEST-OUTPUT-OCCUPANCY-CELL-FIRST ALGORITHM LOOFA
79
occupancy, and sends a request to that output. Ties are broken by Ž .
selecting the smallest output port number. See Figure 3.23 a . 3. Each output, on receiving requests from multiple inputs, selects the one
with the smallest OCC and sends the grant to that input. Ties are broken by selecting the smallest port number.
4. Return to step 2 until no more connections can be made. An example of the greedy version is shown in Figure 3.23. The tuple X, Y
Ž . in the VOQ represents the output occupancy OCC c and the timestamp
Ž . TS c of cell c, respectively. In the upper part of the figure, the arrows
indicate the destinations for all different cells at the input ports. The gray arrows in the lower part of the figure indicate the exchange of requests and
Ž . grants. The black arrows indicate the final match. Part a shows that each
input sends a request to the output with the lowest occupancy. Output 2 receives two requests, one from A and the other from B, while output 3
Ž . receives a request from input C. Part b illustrates that, between the two
requests, output 2 chooses input A, the one with lower TS. Output 3 chooses the only request, input C.
The best-first version works as follows: 1. Initially, all inputs and outputs are unmatched.
2. Among all unmatched outputs, the output with the lowest occupancy is selected. Ties are broken by selecting the smallest output port number.
All inputs that have a cell destined for the selected output send a request to it.
3. The output selects the cell request input with the smallest time stamp and sends the grant to the input. Ties are broken by selecting the
smallest input port number. Ž
4. Return to step 2 until no more connections can be made or N .
iterations are completed . Figure 3.24 shows a matching example with the best-first version. The
Ž . selection of the output with the lowest OCC c results in a tie: Outputs 2 and
3 have the lowest OCC. This tie is broken by selecting output 2, since this port number is the smaller. Therefore, inputs A and B send a request to this
Ž . Ž .
output as shown in part b , while part c illustrates that output 2 grants the Ž .
oldest cell, input A. Part d shows the matching result after the first Ž .
iteration. The second iteration begins in part e when output 3 is chosen as the unmatched output port with the lowest OCC with requests from inputs B
Ž . Ž .
Ž . and C. Input B is chosen in part f for its lowest TS c . Part g depicts the
final match. Both algorithms achieve a maximal matching, with the greedy version
achieving it in less iterations. On the other hand, it has been proven that, when combined with the oldest-cell-first input selection scheme, the best-first
INPUT-BUFFERED SWITCHES
80
Fig. 3.24 A matching example with the best-first version of LOOFA.
version provides delay bounds for rate-controlled input traffic under a speedup of 2. Denote by D and D the arbitration delay and the output
a o
Ž .
queuing delay of any cell. It can be shown that that D F 4 Nr S y 1 and
a
D F 2 N cell slots, where S is the speedup factor.
o
REFERENCES
1. M. Ajmone, A. Bianco, and E. Leonardi, ‘‘RPA: a simple efficient and flexible policy for input buffered ATM switches,’’ IEEE Commun. Lett., vol. 1, no. 3,
pp. 83᎐86, May 1997. 2. T. Anderson, S. Owicki, J. Saxe, and C. Thacker, ‘‘High speed scheduling for
local area networks,’’ Proc. 5th Int. Conf. on Architecture Support for Programming Languages and Operating Systems, pp. 98᎐110, Oct. 1992.
3. H. J. Chao and J. S. Park, ‘‘Centralized contention resolution schemes for a
large-capacity optical ATM switch,’’ Proc. IEEE ATM Workshop, Fairfax, VA, May 1998.
REFERENCES
81
4. H. J. Chao, ‘‘Satrun: a terabit packet switch using dual round-robin,’’ IEEE
Commun. Mag., vol. 38, no. 12, pp. 78᎐79, Dec 2000. 5. A. Charny, P. Krishna, N. Patel, and R. Simcoe, ‘‘Algorithm for providing
bandwidth and delay guarantees in input-buffered crossbars with speedup,’’ Proc. IEEE IWQoS
X
98, May 1998, pp. 235᎐244. 6. J. S.-C. Chen and T. E. Stern, ‘‘Throughput analysis, optimal buffer allocation,
and traffic imbalance study of a generic nonblocking packet switch,’’ IEEE J. Select. Areas Commun., vol. 9, no. 3, pp. 439᎐449, Apr. 1991.
7. S.-T. Chuang, A. Goel, N. McKeown, and B. Prabhakar, ‘‘Matching output queueing with a combined inputroutput-queued switch,’’ IEEE J. Select. Areas
Commun., vol. 17, no. 6, pp. 1030᎐1039, Jun. 1999. 8. A. Descloux, ‘‘Contention probabilities in packet switching networks with strung
input processes,’’ Proc. ITC 12, 1988. 9. H. ElGebaly, J. Muzio, and F. ElGuibaly, ‘‘Input smoothing with buffering: a new
technique for queuing in fast packet switching,’’ Proc. IEEE Pacific Rim Conf. on Communications, Computer, and Signal Processing, pp. 59᎐62, 1995.
10. K. Genda, Y. Doi, K. Endo, T. Kawamura, and S. Sasaki, ‘‘A 160 Gbrs ATM switching system using an internal speed-up crossbar switch,’’ Proc. GLOBECOM
X
94, Nov. 1994, pp. 123᎐133, 1998. 11. J. N. Giacopelli, W. D. Sincoskie, and M. Littlewood, ‘‘Sunshine: a high perfor-
mance self routing broadband packet switch architecture,’’ Proc. Int. Switching Symp., Jun. 1990.
12. R. Guerin and K. N. Sivarajan, ‘‘Delay and throughput performance of speeded-up input-queueing packet switches,’’ IBM Research Report RC 20892, Jun. 1997.
13. M. G. Hluchyj and M. J. Karol, ‘‘Queueing in high-performance packet switching,’’ IEEE J. Select. Areas Commun., vol. 6, no. 9, pp. 1587᎐1597, Dec. 1988.
14. A. Huang and S. Knauer, ‘‘Starlite: a wideband digital switch,’’ Proc. IEEE Globecom
X
84, pp.121᎐125, Dec. 1984. 15. I. Iliadis and W. E. Denzel, ‘‘Performance of packet switches with input and
output queueing,’’ Proc. ICC
X
90, Atlanta, GA, pp. 747᎐753, Apr. 1990. 16. P. Krishna, N. Patel, A. Charny, and R. Simcoe, ‘‘On the speedup required for
work-conserving crossbar switches,’’ IEEE J. Select. Areas Commun., vol. 17, no. 6, June 1999.
17. R. O. LaMaire and D. N. Serpanos, ‘‘Two dimensional round-robin schedulers for packet switches with multiple input queues,’’ IEEErACM Trans. on Network-
ing, Vol. 2, No. 5, Oct. 1994. 18. T. T. Lee, ‘‘A modular architecture for very large packet switches,’’ IEEE Trans.
Commun., vol. 38, no. 7, pp. 1097᎐1106, Jul. 1990. 19. S. Q. Li, ‘‘Performance of a nonblocking space-division packet switch with
correlated input traffic,’’ IEEE Trans. Commun., vol. 40, no. 1, pp. 97᎐107, Jan. 1992.
20. S. C. Liew, ‘‘Performance of various input-buffered and output-buffered ATM switch design principles under bursty traffic: simulation study,’’ IEEE Trans.
Commun., vol. 42, no. 2r3r4, pp. 1371᎐1379, Feb.rMar.rApr. 1994. 21. P. Newman, ‘‘A fast packet switch for the integrated services backbone network,’’
IEEE J. Select. Areas Commun., vol. 6, pp. 1468᎐1479, 1988.
INPUT-BUFFERED SWITCHES
82
22. M. A. Marsan, A. Bianco, P. Giaccone, E. Leonardi, and F. Neri, ‘‘Packet scheduling in input-queued cell-based switches,’’ IEEE J. Select. Areas Commun.,
1998, 1998. 23. N. McKeown, P. Varaiya, and J. Warland, ‘‘Scheduling cells in an input-queued
switch,’’ IEEE Electron. Lett., pp. 2174᎐2175, Dec. 9th 1993. 24. N. McKeown, ‘‘Scheduling algorithms for input-queued cell switches,’’ Ph.D.
thesis, Uni®ersity of California at Berkeley, 1995. 25. N. McKeown, ‘‘The iSLIP scheduling algorithm for input-queued switches,’’
IEEErACM Trans. Networking, vol. 7, no. 2, pp. 188᎐201, Apr. 1999. 26. A. Mekkitikul and N. McKeown,
‘‘A starvation-free algorithm for achieving 100 throughput in an input-queued switch,’’ in Proc. ICCCN
X
96, 1996. 27. Y. Oie, M. Murata, K. Kubota, and H. Miyahara, ‘‘Effect of speedup in nonblock-
ing packet switch,’’ Proc. ICC
X
89, pp. 410᎐414, Jun. 1989. 28. A. Pattavina and G. Bruzzi, ‘‘Analysis of input and output queueing for nonblock-
ing ATM switches,’’ IEEErACM Trans. Networking, vol. 1, no. 3, pp. 314᎐328, Jun. 1993.
29. B. Prabhakar and N. McKeown, ‘‘On the speedup required for combined input and output queued switching,’’ Technical Report, CSL-TR-97-738, Computer
Lab., Stanford University. 30. A. Smiljanic, R. Fan, and G. Ramamurthy,
‘‘RRGSᎏround-robin greedy
´
scheduling for electronicroptical switches,’’ Proc. IEEE Globecom
X
99, pp. 1244᎐1250, 1999.
31. A. Smiljanic, ‘‘Flexible bandwidth allocation in terabit packet switches,’’ Proc.
´
IEEE Conf. on High-Performance Switching and Routing, pp. 223᎐241, 2000. 32. I. Stoica and H. Zhang, ‘‘Exact emulation of an output queueing switch by
a combined input and output queueing switch,’’ Proc. IEEE IWQoS
X
98, pp. 218᎐224, May 1998.
H. Jonathan Chao, Cheuk H. Lam, Eiji Oki Copyright 䊚 2001 John Wiley Sons, Inc.
Ž .
Ž .
ISBNs: 0-471-00454-5 Hardback ; 0-471-22440-5 Electronic
CHAPTER 4
SHARED-MEMORY SWITCHES
In shared-memory switches, all input and output ports have access to a common memory. In every cell time slot, all input ports can store incoming
Ž .
cells and all output ports can retrieve their outgoing cells if any . A
shared-memory switch works essentially as an output-buffered switch, and therefore also achieves the optimal throughput and delay performance.
Furthermore, for a given cell loss rate, a shared-memory switch requires less buffers than other switches.
Because of centralized memory management to achieve buffer sharing, however, the switch size is limited by the memory readrwrite access time,
within which N incoming and N outgoing cells in a time slot need to be accessed. As shown in the formula given below, the memory access cycle
must be shorter than 1r2 N of the cell slot, which is the transmission time of a cell on the link:
cell length memory access cycle F
. 2 ⭈ N ⭈ link speed
Ž For instance, with a cell slot of 2.83 s 53-byte cells at the line rate of
. 149.76 Mbitrs, or 155.52 Mbitrs = 26r27 and with a memory cycle time of
10 ns, the switch size is limited to 141. Several commercial ATM switch systems based on the shared memory
architecture provide a capacity of several tens of gigabits per second. Some people may argue that memory density doubles every 18 months and so the
memory saving by the shared-memory architecture is not that significant. However, since the memory used in the ATM switch requires high speed
83
SHARED-MEMORY SWITCHES
84
Ž .
e.g., 5᎐10-ns cycle time , it is expensive. Thus, reducing the total buffer size can considerably reduce the implementation cost. Some shared-
memory switch chip sets have the capability of integrating with other space Ž
switches to build a large-capacity switch e.g., a few hundred gigabits per .
second . Although the shared-memory switch has the advantage of saving buffer
size, the buffer can be occupied by one or a few output ports that are congested and thus leave no room for other cells destined for other output
ports. Thus, there is normally a cap on the buffer size that can be used by any output port.
The following sections discuss different approaches to organize the shared memory and necessary control logics. The basic idea of the shared-memory
switch is to use logical queues to link the cells destined for the same output port. Section 4.1 describes this basic concept, the structure of the logical
queues, and the pointer processing associated with writing and reading cells to and from the shared memory. Section 4.2 describes a different approach to
implement the shared-memory switch by using a content-addressable memory Ž
. Ž
. CAM instead of a random access memory RAM as in most approaches.
Although CAM is not as cost-effective and fast as RAM, the idea of using CAM to implement the shared-memory switch is interesting because it
eliminates the need of maintaining logical queues. The switch size is limited by the memory chip’s speed constraint, but several approaches have been
proposed to increase it, such as the space᎐time᎐space approach in Section 4.3 and multistage shared-memory switches in Section 4.4. Section 4.5 de-
scribes several shared-memory switch architectures to accommodate multi- casting capability.
4.1 LINKED LIST APPROACH