Performance of PPA Implementation of PPA

OPTICAL PACKET SWITCHES 318 the flag should be kept unchanged in order to preserve the original prefer- ence. As shown in Figure 11.32, the external grant signal of a leaf AR2 can be added at the final stage to allow other local logical operations to be finished while waiting for the grant signals from upper layers, which mini- mizes the total arbitration time. Suppose N inputs are served in increasing order of their input numbers, i.e., 1 ™ 2 ™ ⭈⭈⭈ ™ N ™ 1 under a RR scheme. Each AR2 by itself per- forms RR service for its two inputs. The Ping-Pong arbitration consisting of the tree of AR2s shown in Figure 11.32 can serve the inputs in the order of 1 ™ 3 ™ 2 ™ 4 ™ 1 when N s 4, for instance, which is still RR, if each input always has a packet to send and there is no conflict between any of the input request signals. Below its performance is shown by simulations.

11.4.5.2 Performance of PPA

The performance of the PPA, FIFO q RR Ž . FIFO for input queuing and RR for arbitration , and output queuing is compared here. A speedup factor of two is used for PPA and FIFO q RR. Simulation results are obtained from a 32 = 32 switch under uniform traffic Ž . the output address of each segment is equally distributed among all outputs Ž . and under bursty traffic on-off geometric distribution with an average burst length of 10 segments. The bursty traffic can be used as a packet traffic model with each burst representing a packet of multiple segments destined Ž . for the same output. The output address of each packet burst is also equally distributed among all outputs. Figure 11.33 shows the throughput and total average delay of the switch under various arbitration schemes. It can be seen that the PPA performs comparably to the output queuing and the FIFO q RR. However, the output queuing is not scalable, and the RR arbitration is slower than the PPA. The overall arbitration time of the PPA for an N-input switch is proportional to u v log Nr2 when every four inputs are grouped at each layer. For instance, 4 the PPA can reduce the arbitration time of a 256 = 256 switch to 11 gate delays, less than 5 ns using the current CMOS technology.

11.4.5.3 Implementation of PPA

Multiple small arbiters can be recur- sively grouped together to form a large multilayer arbiter, as illustrated in Figure 11.32. Figure 11.34 depicts an n-input arbiter constructed by using Ž . p q-input arbiters AR-q , from which the group request and grant signals are Ž . incorporated into a p-input arbiter AR-p . Constructing a 256-input arbiter starting from two-input arbiters as basic units is shown below. Ž . Figure 11.35 shows a basic two-input arbiter AR2 and its logical circuits. The AR2 contains an internally feedbacked flag signal, denoted by F , that i 1 Ž indicates which input is favored. When all G inputs are 1 indicating that g the two input requests R and R as a whole are granted by all the upper 1 . layers , once one input is granted in an arbitration cycle, the other input will Ž . be favored in the next cycle, as shown by the truth table in Figure 11.35 a . This mechanism is maintained by producing an output flag signal, denoted by OPTICAL INTERCONNECTION NETWORK FOR TERABIT IP ROUTERS 319 Fig. 11.34 Hierarchy of recursive arbitration with n s pq inputs. F , feedbacked to the input. Between F and F there is a D flip-flop which o o i functions as a register forwarding F to F at the beginning of each cell time o i slot. When at least one of the G inputs is 0, indicating the group request of g Ž . R and R is not granted at some upper layer s , we have G s G s 0, 1 1 F s F , that is, the flag is kept unchanged in order to preserve the original o i Ž . preference. As shown in Figure 11.35 b , the local grant signals have to be ANDed with the grant signals from the upper layers to provide full informa- tion whether the corresponding input is granted or not. G inputs are added g at the final stage to allow other local logical operations to be finished in order to minimize the total arbitration time. Ž . A four-input arbiter module AR4 has four request signals, four output grant signals, one outgoing group request, and one incoming group grant Ž . signal. Figure 11.36 a depicts our design of an AR4 constructed by three Ž . AR2s two leaf AR2s and one intermediate AR2; all have the same circuitry , two two-input OR gates, and one four-input OR gate. Each leaf AR2 handles a pair of inputs and generates the local grant signals while allowing two external grant signals coming from upper layers: one from the intermediate AR2 inside the AR4, and the other from outside AR4. These two signals directly join at the AND gates at the final stage inside each leaf AR2 for minimizing the delay. Denote by R and G the group request signal and the i j i j group grant signal between input i and input j. The intermediate AR2 Ž . handles the group requests R and R and generates the grant signals 01 23 Ž . G and G to each leaf AR2 respectively. It contains only one grant 01 23 signal, which is from the upper layer for controlling the flag signal. Ž . As shown in Figure 11.36 b , an AR16 contains five AR4s in two layers: four in the lower layer handling the local input request signals, and one in the upper layer handling the group request signals. 1 When the flag is LOW, R is favored; when the flag is HIGH, R is favored. 1 OPTICAL PACKET SWITCHES 320 Ž . Ž . Ž . Fig. 11.35 a A two-input arbiter AR2 and its truth table; b its logic circuits. Ž . 䊚1999 IEEE. Ž . Figure 11.37 illustrates a 256-input arbiter AR256 constructed from AR4s, and its arbitration delay components. The path numbered from 1 to 11 shows the delay from when an input sends its request signal till it receives the Ž . grant signal. The first four gates delays 1᎐4 constitute the time for the input’s request signal to pass though the four layers of AR4s and reach the root AR2, where one OR-gate delay is needed at each layer to generate the w Ž .x Ž . request signal see Figure 11.36 a . The next three gate delays 5᎐7 consti- w tute the time while the root AR2 performs its arbitration see Figure OPTICAL INTERCONNECTION NETWORK FOR TERABIT IP ROUTERS 321 Ž . Ž . Ž . Ž . Fig. 11.36 a A 4-input arbiter AR4 and b a 16-input arbiter AR16 constructed Ž . from five AR4s. 䊚1999 IEEE. Ž .x Ž . 11.35 b . The last four gate delays 8᎐11 constitute the time for the grant signals in the upper layers to pass down to the corresponding input. The total arbitration time of an AR256 is then 11 gate delays. It thus follows that the Ž . arbitration time T of an n-input arbiter using such implementation is n n T s 2 log q 3. 11.1 Ž . n 4 2

11.4.5.4 Priority PPA