M01972
2016 IEEE 17th International Conference on High Performance Switching and Routing
A Hardware-accelerated Infrastructure for Flexible
Sketch-based Network Traffic Monitoring
Theophilus Wellem∗‡ , Yu-Kuen Lai† , Chao-Yuan Huang† , and Wen-Yaw Chung∗
∗ Department of Electronic Engineering, † Department of Electrical Engineering
Chung Yuan Christian University, Zhongli 32023, Taiwan
Email: {g10202604, ylai, g10279008, eldanny}@cycu.edu.tw
‡ Department of Informatics
Satya Wacana Christian University, Salatiga 50711, Indonesia
Email: [email protected]
Abstract—Sketch-based data streaming algorithms are used in
many network traffic monitoring applications to obtain accurate
estimates of traffic flow. However, the flexibility is limited as hardware implementation of sketch counters may not be re-used for
different measurement tasks. In this paper, we develop a generic
hardware infrastructure for collecting flow statistics. The purpose
is to achieve the goal of adopting various sketch-based algorithms
with arbitrary flow aggregations for monitoring applications and
measurement tasks in a flexible manner. Multiple-choice hashing
with linear probing scheme is utilized for high-speed counter
update process. Simulation results based on real traffic traces for
monitoring applications are presented. The proposed hardware
infrastructure is implemented on the NetFPGA-10G platform.
The system is capable of processing network traffic at 53 Gbps in
a worst-case scenario of 64-byte minimum-sized Ethernet frame.
Index Terms—Network traffic monitoring; Sketch; 5-universal
hash; Hash table; Linear probing; NetFPGA-10G;
I. I NTRODUCTION
Flow-based traffic monitoring is important for daily network
operations. The measurement of flow-level statistics such as
flow size distribution and active flow count can provide information for the purpose of bandwidth usage, capacity planning,
accounting, and anomaly detection [1]. NetFlow [2] is the
standard tool used by network devices for creating and exporting flow information. Per-flow traffic measurement needs to
maintain a counter for each flow during an observation period.
Due to high traffic volume and number of flows in high-speed
links (10 Gbps and 40/100 Gbps), packets are sampled and
aggregated into flows in NetFlow. However, several studies
have shown that packet sampling is not sufficient for finegrained monitoring applications [3], [4].
Alternatively, sketch-based data streaming algorithms are
widely used in many high-speed network monitoring applications (e.g. heavy hitters detection [5], change detection
[6], flow size distribution [7], superspreader detection [8],
entropy estimation [9], [25]). These algorithms use sketch
data structure to summarize high-speed network traffic into
a compact representation without packet sampling. With only
a small amount of memory space, the sketch is capable of
summarizing network traffic by updating its counters for each
incoming packet at wire speed. Since the update process is the
critical bottleneck, the sketch data structure is implemented by
utilizing fast memory (SRAM) of the network device line card.
978-1-4799-8950-8/16/$31.00 ©2016 IEEE
Sketch-based data streaming algorithms are efficient and
have provable memory-accuracy tradeoffs for traffic flow
measurement. Accurate estimation result of a specific traffic
measurement task can be obtained by processing the sketch
data structure. However, the sketch-based methods are too
application-specific and lack of generality to be implemented
as primitive operations on network devices [10], [11]. Sketchbased methods require a separate instance of sketch data
structure for each flow key, which is defined according to
the measurement tasks of interest. Therefore, the flexibility
of using the same sketch for various monitoring applications
is limited once it is implemented directly in hardware.
Motivated by the previous works of a ”minimalist” approach
for flow monitoring system [11], [12] and OpenSketch [10],
this paper extends our preliminary work [13] to create generic
hardware infrastructure for a flexible sketch-based traffic
monitoring system. The FPGA based fast data plane proposed
in [13] is designed for flow-level data collection without packet
sampling. It consists of two tables: flow counter table and flow
key table. The flow counter table maintains packet counts and
byte counts, while the flow key table stores 5-tuple keys. The
contents in these tables are sent to host CPUs and processed.
Therefore, as required by different monitoring applications,
it is possible to update various sketch data structures with
flexible selection of flow keys.
In this work, we present the infrastructure design of a flow
counter table supporting greater flexibility of utilizing sketchbased algorithms for traffic monitoring. We need to emphasize
that, in contrast to many works on counter architectures [14]–
[20], the flow counter table implementation in this work is
neither targeted for per-flow traffic measurement nor tried to
minimize the required SRAM size. Instead, it aims to support
the flexibility of sketch-based monitoring applications. Due to
the requirement of updating flow counters in wire speed, hash
table with linear probing for collision resolution is used for
the table implementation. Linear probing is chosen because of
the simplicity in implementation, supporting very fast lookup
and insertion with burst operations in QDRII SRAM. Recent
works showed that using a 5-universal hash function, linear
probing can achieve expected performance [21], [22]. The
average number of probes per update is a constant of no more
than four [23]. Furthermore, multiple-choice hashing scheme
162
2016 IEEE 17th International Conference on High Performance Switching and Routing
with stash [24] is implemented to improve the table utilization.
The main contributions of this work are:
• We propose a flow counter table scheme by using hash
table with linear probing supporting various sketch-based
algorithms for traffic monitoring.
• We present the design detail and performance evaluation
of the FPGA prototype. The system is capable of supporting up to 2 million flow entries and achieving the
throughput of 53 Gbps in the worst-case scenario of 64byte minimum-sized Ethernet frame.
The rest of this paper is organized as follows. Sketch data
structure and its applications for network traffic monitoring
are outlined in Section II. Section III presents the system
architecture for a flexible sketch-based network traffic monitoring. This section also describes the flow counter table
implementation and its evaluation in software. Trace-driven
simulation results by using real traffic trace are presented.
In Section IV, hardware infrastructure of the flow counter
table and its evaluation are explained. Section V discusses
the system’s limitations, scalability, and some design issues
for implementing table with stash in hardware. Section VI
gives the example of applications that can utilize the hardware
infrastructure designed in Section IV. Some related works
are briefly reviewed in Section VII. Finally, Section VIII
concludes this paper.
II. BACKGROUND
Sketch is a data structure representing the compact summaries of a data stream. It can be queried to obtain statistics
about the data stream. A sketch can be arranged as an array
of counters C [i] [j], 1 ≤ i ≤ d, 1 ≤ j ≤ w. It consists of
d hash tables of size w. Each row is associated to a hash
function selected from a universal class of hash functions.
When an item a = (k, v) arrives, the key k is hashed by d
hash functions and the corresponding counters are updated by
v. A query of the sketch can yield the estimated counts for the
key k. For example, query of a key to Count-Min sketch [26]
returns the minimum value among all counters that correspond
to that key. One of the important properties of sketch data
structure is its linearity. Therefore, allowing sketches from
different observation intervals be combined using arithmetic
operations.
To use sketch for traffic measurement, the sketch data
structure must be updated based on a pre-defined flow key,
which is a combination of tuples from the packet header.
The update value can be the packet size or packet count. For
example, the heavy hitters detection application uses source
IP address and the packet count as its key and updated value,
respectively. In change detection application [6], the sketch
is updated using the source IP address and packet size. In
superspreader detection applications [8], [27], [28], the update
process sets the bit in the bitmap sketch based on source IP
address and pair of source and destination IP addresses.
III. S YSTEM A RCHITECTURE
A. Design Overview
The proposed system for flexible implementation of sketchbased traffic monitoring is presented in Fig. 1. The upper part
Monitoring Applications
(Flow monitoring and statistics functions )
Monitoring App. 1
(Entropy estimation )
Monitoring App. 2
(Superspreader
detection)
...
Monitoring App. n
(Change detection )
Query /Estimate
Sketches
...
Sketches
Sketches
Update
Traffic
Header Parser
(Field Extractor)
Hash
Function
Buffer/
Controller
Buffer/
Controller
Flow counter
Table
Bloom Filter
Flow key
Table
Data plane
Fig. 1. System architecture.
is implemented in software on multi-core CPU for parallel
processing of sketches, while the bottom data plane is implemented in FPGA hardware. As highlighted in Fig. 1, the
bottom part consists of two main tables. The flow counter table
stores the flow ID, packet count, and byte count. The flow key
table contains the 5-tuple flow information.
When a packet arrives, its 5-tuple (source IP, destination
IP, source port, destination port, and protocol) header values
are extracted as the flow key by the header parser module,
and hashed by the hash function module. It is important to
maintain the flow counter update in a fixed number of memory
cycles for line-rate packet processing. Therefore, the flow
counter table is constructed based on the scheme of hashing
with linear probing. The 5-universal hash function is used
to achieve an average constant number of probes per update
[23]. Leverage on the four-word burst access of the QDRII
SRAM, the shortest memory update latency can be reached.
A Bloom filter is utilized to check the presence of the flow
in the flow key table. If the flow does not exist in current
observation interval, then the 5-tuple information is added
to the RLDRAM-based flow key table in a first-in first-out
fashion. At the end of observation interval, the values in both
flow counter table and flow key table are moved to the host
CPU. The goal is to adopt various sketch-based algorithms
with arbitrary flow aggregations for monitoring applications
and measurement tasks in a flexible manner. We focus on the
implementation of the flow counter table part in this paper.
B. System Simulation and Evaluation
The proposed system model is constructed first in software
to evaluate the performance of flow counter update operations.
Three schemes with different number of tables are used for
comparison: 1) One table without stash, 2) One table with
stash, and 3) Two tables with stash. Trace-driven simulations
using CAIDA traces [29] are conducted to observe the table
and stash load factors. The CAIDA traces are collected over
a 10Gbps Ethernet link of a Tier 1 ISP. It comprises of 30.8
million packets and 1.4 million distinct 5-tuple flows with a
duration of 60s. The observation interval is set as 15s in the
163
2016 IEEE 17th International Conference on High Performance Switching and Routing
Probe length per update
Table size = 512K
Load factor
Table size = 512K vs. 1M
100
Interval 5s
Interval 10s
Interval 15s
150000
Load factor (%)
Number of flows
200000
Read
to CPU
Ingress
Queues
100000
50000
Interval 5s
Interval 10s
Interval 15s
80
60
PCIe
Input
Arbiter
40
Hash
R/W
Control
FSM
0
0
4
8
12
16
20
24
512K
Probe length
Table size
(a)
(b)
Load factor (1 Table + Stash)
Table size=512K
100
100
80
60
60
40
40
20
20
0
0
Int.10s
Table
Int.15s
Table 1
Stash
(a)
Int.10s
Table 2
Int.15s
Stash
(b)
Number of flows (1 Table + Stash)
Table size=512K
100
Number of flows (2 Tables + Stash)
Table size=2x256K
100
80
80
60
60
40
40
20
20
0
Table
Int.15s
Stash
(c)
QDR
SRAM 2
SRAM
Controller 3
QDR
SRAM 3
stash size can accommodate more than 64K entries.
Two tables (256K entries each) with stash scheme can be
used for better table occupancy with fewer entries in the
stash, as shown in Fig. 3b. In term of flows, 92.6% flows are
stored in both tables and only 7.4% (≈ 35k flows) are placed
into the stash as shown in Fig. 3d. These results demonstrate
that by using two tables, the table occupancy increases while
maintaining the same total table size of 512K entries, but
with less than 64K entries for stash size. Therefore, comparing
to the one table with stash, this scheme can store 7% more
flows in the table and reduce 46.66% entries in the stash The
same trend is observed for the other observation interval. The
average time per update (hash computation, lookup, insertion)
for 30.8 million keys is 1.43 µs on a 2.8 GHz AMD A8-3850
APU with 256KB L1 cache and 4MB L2 cache.
IV. H ARDWARE I MPLEMENTATION
0
Int.10s
SRAM
Controller 2
Fig. 4. Block diagram of the hardware prototype
Load factor (2 Tables + Stash)
Table size=2x256K
80
QDR
SRAM 1
1M
Fig. 2. (a): Number of flows vs. probe length for different observation
intervals. The number of flows in this interval: 5s=208k, 10s=346k, 15s=446k.
For 10s and 15s observation intervals, this figure only shows the number of
flows that have probe length up to 24. (b): Table load factor using different
table sizes for three different length of observation intervals.
Load factor (%)
Field
Extractor
20
0
Percentage (%)
FIFO
SRAM
Controller 1
Table 1
Int.10s
Table 2
Int.15s
Stash
(d)
Fig. 3. (a) and (b): Average table(s) and stash load factors, (c) and (d):
Average number of flows (in percentage) in table(s) and stash for different
observation intervals. Probe length ≤ 4. The average number of flows in 10s
and 15s intervals are 349.3k and 476.5k, respectively.
simulation. In this interval, the number of distinct 5-tuple flows
is ranged from 445k to 510k.
We first show that by using 5-universal hash function on
one hash table without stash scheme, most of the flows have
probe length less or equal to four. Using a 512K-entry table
with 5s and 10s observation intervals, the percentage of flows
that have probe length less or equal to four are 98.97% and
92.99%, respectively. With observation interval of 15s, 84.1%
of the flows have probe length less or equal to four. The
maximum probe length of 350 is observed for only one flow.
After all flows were inserted into the table, the load factor is
85.13%. Probe length per flow key for different observation
intervals is shown in Fig. 2a. The table load factor for different
observation intervals and table sizes is shown in Fig. 2b.
For one table with stash scheme, the maximum probe length
is limited to four. Fig. 3a shows the table and stash load
factors for different observation intervals. In 15s interval for
one table with stash scheme, the table and stash load factors
are 78.16% and 51%, respectively. The percentage of flows
that are stored in table and stash is shown in Fig. 3c. As
shown, about 86% of flows are stored in the table and the
remaining 14% (≈ 66k flows) are placed into the stash. The
The flow counter table hardware prototype is implemented
on the NetFPGA-10G [30] platform which contains a Virtex5 XC5VTX240T FPGA, four 10Gb Ethernet SFP+ interfaces,
three banks of 9 MB QDRII SRAMs with a total capacity of
27 MB, and four 72 MB RLDRAM II memories. The board
is hosted on a commodity workstation and attached to the
PCIe slot. The prototype block diagram is shown in Fig. 4.
The four 10Gb Ethernet ports, each has a 64-bit data path, are
connected to four ingress queues which collect the incoming
packets from network. The Input arbiter module takes packets
from the four ingress queues using a round robin policy and
sends it to the next module in the data path. The output of the
input arbiter module is a 256-bit data bus. The field extractor
module extracts the 104-bit 5-tuple and 16-bit packet length
from each packet and sends it to the hash function module.
A. Hash module
This module utilizes the scheme of collision-free mapping
with high probability [23]. The scheme processes the 104-bit
5-tuple input to obtain its 32-bit flow ID using a 2-universal
tabulation-based hashing (TabChar). TabChar hashing computes a hash value by using table lookups and XOR operations
so the operation is fast and takes only one clock cycle. Small
amount of on-chip memory space is used to store the lookup
tables. The 32-bit flow ID is then hashed again using a 5universal hash function. It is computed based on the four
degree polynomial over some prime fields as follows.
4
i
h (x) =
mod p
(1)
ai x
164
i=0
2016 IEEE 17th International Conference on High Performance Switching and Routing
TABLE I
L OGIC R ESOURCE C ONSUMPTION FOR H ASH F UNCTIONS
Hash Function
TabChar
5-Univ Mult.
# of Registers
32 of 149760
613 of 149760
# of LUTs
97 of 149760
984 of 149760
144 bits
# of BRAMs
7 of 324
0 of 324
SRAM 3
Addr.
0
1
B. Read/Write Control module
Updating counters in the table requires a read-modify-write
operation for each incoming packet. The Read/Write Control
module manages the memory read and write processes. It also
takes care of the read-after-write hazard that may occur during
counter update process. A read-after-write hazard occurs if an
input ID enters the pipeline before previous update on the same
input has been completed. The circuit compares the current
input and address to previous input and address (hash value),
then updates the counters accordingly to ensure that values
written to memory are correct. In READ state, the current
input ID and its address are compared to previous data stored
in shift registers. If a match is found, then the corresponding
input ID’s most recent data from the shift register are used.
Otherwise, a memory read is issued to get the previous data
of the current input ID from memory. The packet and byte
counts are then updated and written back to memory in
WRITE state. This module also handles the registers read/write
process to communicate with software. Register infrastructure
on NetFPGA-10G provides access to software running on host
CPU to read data from hardware via the PCIe interface.
C. Flow Counter Table
Three QDRII SRAMs are used to store the flow counter
table. Hash collision is handled by comparing the input key
to multiple entries in a bucket in parallel. The mechanism
makes use of the four-word burst access of the QDRII SRAM.
A read burst can retrieves four 36-bit words (144 bits) from
one QDRII SRAM chip in only five memory clock cycles.
Therefore, four IDs can be read and compared in parallel to the
input ID with very short latency. If the input is a new ID, a flow
entry is created and its counters are updated. Otherwise, only
the counters are updated and then written back to memory.
Due to the resource restriction in NetFPGA-10G platform,
only one table without stash scheme is implemented for fast
prototyping. A flow entry of 108-bit data is stored in three
banks of QDRII SRAM. It consists of 16-bit ID, 32-bit packet
counter, and 60-bit byte counter. The 16-bit ID is obtained by
hashing the 32-bit flow ID (i.e., the result from TabChar) using
a 2-universal hash function. The flow entries layout is shown
in Fig. 5. The three QDRII SRAMs can accommodate up to
2 million flow entries.
D. Synthesis Result and Evaluation
The hardware implementation has been synthesized and
evaluated by sending packets to the four 10GbE interfaces
...
SRAM 2
...
The 5-universal hash function is implemented in a pipeline
fashion with seven embedded multipliers. The logic resources
consumed by the TabChar and 5-universal hash are shown in
Table I.
512K
SRAM 1
36-bit
36-bit
36-bit
36-bit
Fig. 5. Flow entries layout in QDRII SRAM. This figure shows 8 different
flow entries represented by different colors. A 108-bit flow entry is stored in
36-bit word across three banks of QDRII SRAM.
TABLE II
L OGIC R ESOURCES U TILIZATION
Resources
# of Registers
# of LUTs
# of Block RAMs
Utilization
61362 of 149760
57110 of 149760
144 of 324
%
40%
38%
44%
using Xilinx ISE/EDK design tools. The system is able to
process one packet in two clock cycles at 210 MHz. Therefore,
it can process 105 million packets per second. Assuming
the worst case scenario of minimum Ethernet frame size of
64 bytes, the throughput is 53 Gbps. The logic resources
utilization for the whole system is shown in Table II. The
prototype requires less than 50% logic resources of the FPGA.
These results show that our scheme is feasible for hardware
implementation.
V. D ISCUSSION
A. Limitations and Scalability
Current implementation of the system is limited only to
process IPv4 flows. With 108 bits for each flow entry, approximately 2 million flow entries can be accommodated in
the QDRII SRAM of 27 MB. Nevertheless, the flow counter
table in the proposed system is scalable to store more than
2 million flow entries. The number of flow entries that can
be stored in the flow counter table depends mainly on the
SRAM size. Moreover, the SRAM space required for the flow
counter table is depend on the total counter width for holding
the flow ID, packet count and byte count. More flow entries
can be accommodated by using the latest SRAM architecture
such as QDR-IV SRAM of larger size. In order to support IPv6
flow monitoring, both header parser and hash function modules
need to be modified to satisfy IPv6 format and requirement
for its flow label. For IPv6 flows, the 5-tuple is longer than
104 bits in IPv4. However, this will only affect the DRAM
space requirement because in the system, the 5-tuple flow
key is stored in DRAM. Since the DRAM space available is
usually larger compared to SRAM (e.g. the newer NetFPGA
SUME board has 8 GB DDR3 DRAM), storing IPv6 flows
is straightforward. Comparing to that of IPv4, the processing
IPv6 flow may take more time (clock cycles) affecting system
throughput.
165
2016 IEEE 17th International Conference on High Performance Switching and Routing
B. Table with Stash Implementation
VI. M ONITORING A PPLICATIONS
Monitoring applications that utilize sketch data structures
can take advantage from our flow-level data collection infrastructure. In this section, the stream-based entropy estimation
and superspreader detection are presented based on the proposed infrastructure.
Entropy estimation: Entropy is a measure of the amount
of randomness of information content in the distribution of
random variables. It has been used in network traffic anomaly
detection [9], [31]. The entropy value is zero if all items in the
traffic are the same, and is maximum if all items in the traffic
are distinct. The simulation of entropy norm estimation is
conducted based on the contents provided by the flow counter
table and flow key table. The results are compared to the exact
entropy norms. The accuracy is measured in term of relative
error. Fig. 6a shows the experiment results of entropy norm
estimation for 5-tuple flow. Different Bloom filter sizes are
used in the experiments. As shown, the application can yield
estimation errors in an acceptable range for anomaly detection.
The estimation accuracy increases with the size of Bloom filter.
For Bloom filter size greater than 256KB, the estimation errors
are less than 1%.
Superspreader detection: The goal of superspreader detection is to find source IPs that contact a large number of
destination IPs. In this case, the application can utilized the
source-destination IP pairs data from the flow key table.
Here, a superspreader is defined as a source whose fan-out
exceeds 200. False negative ratio (FNR) and false positive ratio
(FPR) are used to measure the superspreader identification
accuracy, while the fan-out estimation accuracy is obtained
from scatter plot. The FNR is defined as the number of
superspreaders that are not identified divided by the number
of actual superspreaders, while FPR is the number of nonsuperspreaders that are incorrectly identified as superspreaders
0.4
Bloom filter size 512KB
1024KB
0.3
0.2
0.1
0
1
2
3
Interval
Superspreader detection, 2HashTable
Est. source fan-out
Relative error (%)
Entropy estimation, 5-tuple
Implementation of table with stash scheme in hardware
requires more logic and memory resources since d hash
functions, d tables, and a small table for the stash are required.
The stash is utilized to store flows that cannot be accommodated in the table (i.e., those flows whose probe length
are more than four in our implementation). The stash can
be implemented using a content-addressable memory (CAM)
in hardware. While designing a high performance CAM on
FPGA is challenging, another issue is that the CAM size is
limited and may only store a small amount of flow entries.
Assuming that we need to store up to 64K flow entries, the
CAM size needed is 64K × 108-bit which is approximately
7 Mbit in total. Larger stash size can be implemented using
data structure such as binary search tree (BST) based on the
on-chip Block RAM in the FPGA. However, the Block RAM
is also used by other modules (hash functions, Bloom filter,
etc.) in the design. Therefore, more detail design consideration
is needed. As mentioned previously, due to the resource
restriction in NetFPGA-10G platform, only one table without
stash scheme is implemented. The table with stash scheme
hardware implementation will be addressed in future work.
4
2500
1250
500
200
200
500
1250
2500
Exact source fan-out
(a)
(b)
Fig. 6. (a): Entropy estimation results of 5-tuple flow for different Bloom
filter sizes. (b): Exact vs. estimated fan-out in Interval 4. The number of actual
superspreaders in this interval is 76.
divided by the number of actual superspreaders. To identify
the superspreaders and count their fan-outs, the application
maintains two hash tables. One hash table is used to remove
duplicated source-destination IP pairs, and the other hash table
for counting the fan-out. This method is traditionally used for
exact fan-out counting and is very accurate. The experiment
results show that there are no false positives (i.e., the FPR
values are zero) in all observation intervals, while the FNR
values in both second and third observation intervals are 0.01.
The FNR is due to the false positive of Bloom filter when
the 5-tuple is added into the flow key table. The accuracy of
fan-out estimation is shown in Fig. 6b. The closer the points
to the diagonal line, the more accurate the estimation is. The
average errors in the fan-out estimation is 0.11%.
VII. R ELATED W ORK
Some works have been proposed to create a universal,
minimalist and flexible network flow monitoring system. In
[11], Sekar et al. proposed a ”minimalist” approach for flow
monitoring. The flow-level data is collected by using combination of flow sampling (FS), sample-and-hold (SH), and
coordinated sampling (cSamp). The flow counters stored in
the SRAM are divided for FS and SH. The goal of ”minimalist” approach is to minimize the complexity and resource
requirements for flow monitoring implementation on network
device by using simpler flow level-data collection. Rather than
possessing individual counter in each of the monitoring application, the counter design in ”minimalist” approach can be
used for all monitoring applications. Based on the trace-driven
evaluation, the proposed scheme achieves the similar level
of accuracy to those of application-specific approaches with
the same resources. However, they only provide assumption
and justification on the hardware implementation feasibility,
processing requirements, and memory consumption without
real implementation.
In line with [11], Liu et al. [12] developed a universal monitoring architecture (UnivMon), where a simple and
generic monitoring primitive runs on network device. Using
this simple primitive’s data structure, high accuracy results for
a broad spectrum of monitoring tasks can be obtained. The
results are expected at least to be equivalent or better than
the approach that using custom sketches for each monitoring
task. The UnivMon data plane utilizes parallel Count sketches
[32] to collect flow-level data. Feasibility of a hardware
implementation is not discussed in the paper.
166
2016 IEEE 17th International Conference on High Performance Switching and Routing
OpenSketch [10] uses a hash-based measurement data plane
and sketch data structure to count the flows. The goal of
OpenSketch is to provide a generic and efficient means of measuring network traffic by separating the measurement control
plane from the data plane functions. It supports customized
flow-level data collection such that, the flow to collect and
measure is selected by using TCAM-based classification. After
an observation interval, the sketch counters stored in SRAM
are sent to controller (control plane, where the measurement libraries and monitoring applications reside) for further analysis.
Compared to our flow counter table design, in OpenSketch, the
SRAM is divided into a list of logical tables because different
sketches require different numbers and sizes of counters. This
approach makes the counter access mechanism (addressing)
more complex.
VIII. C ONCLUSION AND F UTURE W ORK
This work proposes a hardware-accelerated infrastructure
which is capable of supporting various sketch-based network
traffic measurement tasks with great flexibility. The implementation detail of flow counter table based on hash table with linear probing is described. The proposed flow counter table has
been evaluated in software simulation with real-world traffic
trace for network traffic entropy estimation and superspreader
detection. The system is implemented on the NetFPGA-10G
platform and capable of processing network traffic at 53 Gbps
in a worst-case scenario of 64-byte minimum-sized Ethernet
frame. For future works, we plan to evaluate the tables with
stash scheme in hardware, and further optimize the hardware
design to achieve higher performance.
ACKNOWLEDGMENT
This research was funded in part by the National Science
Council, Taiwan, under contract number: MOST 104-2221-E033-007 and MOST 103-2221-E-033-030.
R EFERENCES
[1] R. Hofstede, P. Celeda, B. Trammell, I. Drago, R. Sadre, A. Sperotto, and A. Pras,
“Flow Monitoring Explained: From Packet Capture to Data Analysis With NetFlow
and IPFIX,” IEEE Communications Surveys Tutorials, vol. 16, no. 4, pp. 2037–
2064, 2014.
[2] “Cisco NetFlow http://www.cisco.com.”
[3] J. Mai, C.-N. Chuah, A. Sridharan, T. Ye, and H. Zang, “Is sampled data sufficient
for anomaly detection?” in Proc. of the 6th ACM SIGCOMM Conference on
Internet Measurement, IMC ’06. New York, NY, USA: ACM, 2006, pp. 165–176.
[4] D. Brauckhoff, B. Tellenbach, A. Wagner, M. May, and A. Lakhina, “Impact
of packet sampling on anomaly detection metrics,” in Proc. of the 6th ACM
SIGCOMM Conference on Internet Measurement, IMC ’06. New York, NY,
USA: ACM, 2006, pp. 159–164.
[5] C. Estan and G. Varghese, “New directions in traffic measurement and accounting,”
in Proc. of the 2002 Conference on Applications, Technologies, Architectures, and
Protocols for Computer Communications, SIGCOMM ’02. New York, NY, USA:
ACM, 2002, pp. 323–336.
[6] B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen, “Sketch-based change detection:
Methods, evaluation, and applications,” in Proc. of the 3rd ACM SIGCOMM
Conference on Internet Measurement, IMC ’03. ACM, October 2003, pp. 234–
247.
[7] A. Kumar, M. Sung, J. J. Xu, and J. Wang, “Data streaming algorithms for efficient
and accurate estimation of flow size distribution,” in Proc. of the Joint International
Conference on Measurement and Modeling of Computer Systems, SIGMETRICS
’04/Performance ’04. New York, NY, USA: ACM, 2004, pp. 177–188.
[8] Q. Zhao, A. Kumar, and J. Xu, “Joint data streaming and sampling techniques for
detection of super sources and destinations,” in Proc. of the 5th ACM SIGCOMM
Conference on Internet Measurement, IMC ’05. Berkeley, CA, USA: USENIX
Association, 2005, pp. 77–90.
[9] A. Lall, V. Sekar, M. Ogihara, J. Xu, and H. Zhang, “Data streaming algorithms for
estimating entropy of network traffic,” SIGMETRICS Perform. Eval. Rev., vol. 34,
no. 1, pp. 145–156, 2006.
[10] M. Yu, L. Jose, and R. Miao, “Software defined traffic measurement with
OpenSketch,” in Proc. of the 10th USENIX Conference on Networked Systems
Design and Implementation, NSDI’13. Berkeley, CA, USA: USENIX Association,
2013, pp. 29–42.
[11] V. Sekar, M. K. Reiter, and H. Zhang, “Revisiting the case for a minimalist
approach for network flow monitoring,” in Proc. of the 10th ACM SIGCOMM
Conference on Internet Measurement, IMC ’10. New York, NY, USA: ACM,
2010, pp. 328–341.
[12] Z. Liu, G. Vorsanger, V. Braverman, and V. Sekar, “Enabling a ”RISC” Approach
for Software-Defined Monitoring Using Universal Streaming,” in Proc. of the 14th
ACM Workshop on Hot Topics in Networks, HotNets-XIV. New York, NY, USA:
ACM, 2015, pp. 21:1–21:7.
[13] T. Wellem, Y.-K. Lai, and W.-Y. Chung, “A Software Defined Sketch System for
Traffic Monitoring,” in Proc. of the 11th ACM/IEEE Symposium on Architectures
for Networking and Communications Systems, ANCS ’15. Oakland, CA, USA:
IEEE Computer Society, 2015, pp. 197–198.
[14] D. Shah, S. Iyer, B. Prahhakar, and N. McKeown, “Maintaining statistics counters
in router line cards,” IEEE Micro, vol. 22, no. 1, pp. 76–81, Jan. 2002.
[15] S. Ramabhadran and G. Varghese, “Efficient Implementation of a Statistics Counter
Architecture,” in Proc. of the 2003 ACM SIGMETRICS International Conference
on Measurement and Modeling of Computer Systems, SIGMETRICS ’03. New
York, NY, USA: ACM, 2003, pp. 261–271.
[16] Q. Zhao, J. Xu, and Z. Liu, “Design of a Novel Statistics Counter Architecture
with Optimal Space and Time Efficiency,” in Proc. of the Joint International
Conference on Measurement and Modeling of Computer Systems, SIGMETRICS
’06/Performance ’06. New York, NY, USA: ACM, 2006, pp. 323–334.
[17] Y. Lu, A. Montanari, B. Prabhakar, S. Dharmapurikar, and A. Kabbani, “Counter
braids: a novel counter architecture for per-flow measurement,” in Proc. of the
2008 ACM SIGMETRICS international conference on Measurement and modeling
of computer systems, SIGMETRICS ’08. New York, NY, USA: ACM, 2008, pp.
121–132.
[18] N. Hua, J. Xu, B. Lin, and H. Zhao, “BRICK: A Novel Exact Active Statistics
Counter Architecture,” IEEE/ACM Transactions on Networking, vol. 19, no. 3, pp.
670–682, Jun. 2011.
[19] T. Li, S. Chen, and Y. Ling, “Per-flow Traffic Measurement Through Randomized
Counter Sharing,” IEEE/ACM Trans. Netw., vol. 20, no. 5, pp. 1622–1634, Oct.
2012.
[20] C. Hu, B. Liu, H. Zhao, K. Chen, Y. Chen, Y. Cheng, and H. Wu, “Discount
Counting for Fast Flow Statistics on Flow Size and Flow Volume,” IEEE/ACM
Transactions on Networking, vol. 22, no. 3, pp. 970–981, Jun. 2014.
[21] A. Pagh and R. Pagh, “Linear probing with constant independence,” in In Proc.
of the 39th annual ACM symposium on Theory of computing, STOC ’07. ACM
Press, 2007, pp. 318–327.
[22] M. Patrascu and M. Thorup, “On the k-independence required by linear probing
and minwise independence,” in Proc. of the 37th International Colloquium
Conference on Automata, Languages and Programming, ICALP’10.
Berlin,
Heidelberg: Springer-Verlag, 2010, pp. 715–726.
[23] M. Thorup and Y. Zhang, “Tabulation based 5-universal hashing and linear probing,” in Proc. of the 12th Workshop on Algorithm Engineering and Experiments
ALENEX, 2010, pp. 62–76.
[24] A. Kirsch and M. Mitzenmacher, “The power of one move: Hashing schemes for
hardware,” IEEE/ACM Trans. Netw., vol. 18, no. 6, pp. 1752–1765, Dec. 2010.
[25] Y.-K. Lai, T. Wellem, and H.-P. You, “Hardware-assisted estimation of entropy
norm for high-speed network traffic,” Electronics Letters, vol. 50, no. 24, pp.
1845–1847, 2014.
[26] G. Cormode and S. Muthukrishnan, “An improved data stream summary: The
count-min sketch and its applications,” Journal of Algorithms, vol. 55, no. 1, pp.
58–75, April 2005.
[27] T. Wellem, G.-W. Li, and Y.-K. Lai, “Superspreader Detection System on NetFPGA
Platform,” in Proc. of the Tenth ACM/IEEE Symposium on Architectures for
Networking and Communications Systems, ANCS ’14. New York, NY, USA:
ACM, 2014, pp. 247–248.
[28] M. Yoon, T. Li, S. Chen, and J.-K. Peir, “Fit a Compact Spread Estimator in Small
High-Speed Memory,” IEEE/ACM Transactions on Networking, vol. 19, no. 5, pp.
1253–1264, Oct. 2011.
[29] “CAIDA UCSD Anonymized Internet Traces,” 2012.
[30] “NetFPGA project site. http://netfpga.org.”
[31] A. Lakhina, M. Crovella, and C. Diot, “Mining anomalies using traffic feature
distributions,” SIGCOMM Comput. Commun. Rev., vol. 35, no. 4, pp. 217–228,
2005.
[32] M. Charikar, K. Chen, and M. Farach-Colton, “Finding frequent items in data
streams,” in Proc. of the 29th International Colloquium on Automata, Languages
and Programming. Springer-Verlag, July 2002, pp. 693–703.
167
A Hardware-accelerated Infrastructure for Flexible
Sketch-based Network Traffic Monitoring
Theophilus Wellem∗‡ , Yu-Kuen Lai† , Chao-Yuan Huang† , and Wen-Yaw Chung∗
∗ Department of Electronic Engineering, † Department of Electrical Engineering
Chung Yuan Christian University, Zhongli 32023, Taiwan
Email: {g10202604, ylai, g10279008, eldanny}@cycu.edu.tw
‡ Department of Informatics
Satya Wacana Christian University, Salatiga 50711, Indonesia
Email: [email protected]
Abstract—Sketch-based data streaming algorithms are used in
many network traffic monitoring applications to obtain accurate
estimates of traffic flow. However, the flexibility is limited as hardware implementation of sketch counters may not be re-used for
different measurement tasks. In this paper, we develop a generic
hardware infrastructure for collecting flow statistics. The purpose
is to achieve the goal of adopting various sketch-based algorithms
with arbitrary flow aggregations for monitoring applications and
measurement tasks in a flexible manner. Multiple-choice hashing
with linear probing scheme is utilized for high-speed counter
update process. Simulation results based on real traffic traces for
monitoring applications are presented. The proposed hardware
infrastructure is implemented on the NetFPGA-10G platform.
The system is capable of processing network traffic at 53 Gbps in
a worst-case scenario of 64-byte minimum-sized Ethernet frame.
Index Terms—Network traffic monitoring; Sketch; 5-universal
hash; Hash table; Linear probing; NetFPGA-10G;
I. I NTRODUCTION
Flow-based traffic monitoring is important for daily network
operations. The measurement of flow-level statistics such as
flow size distribution and active flow count can provide information for the purpose of bandwidth usage, capacity planning,
accounting, and anomaly detection [1]. NetFlow [2] is the
standard tool used by network devices for creating and exporting flow information. Per-flow traffic measurement needs to
maintain a counter for each flow during an observation period.
Due to high traffic volume and number of flows in high-speed
links (10 Gbps and 40/100 Gbps), packets are sampled and
aggregated into flows in NetFlow. However, several studies
have shown that packet sampling is not sufficient for finegrained monitoring applications [3], [4].
Alternatively, sketch-based data streaming algorithms are
widely used in many high-speed network monitoring applications (e.g. heavy hitters detection [5], change detection
[6], flow size distribution [7], superspreader detection [8],
entropy estimation [9], [25]). These algorithms use sketch
data structure to summarize high-speed network traffic into
a compact representation without packet sampling. With only
a small amount of memory space, the sketch is capable of
summarizing network traffic by updating its counters for each
incoming packet at wire speed. Since the update process is the
critical bottleneck, the sketch data structure is implemented by
utilizing fast memory (SRAM) of the network device line card.
978-1-4799-8950-8/16/$31.00 ©2016 IEEE
Sketch-based data streaming algorithms are efficient and
have provable memory-accuracy tradeoffs for traffic flow
measurement. Accurate estimation result of a specific traffic
measurement task can be obtained by processing the sketch
data structure. However, the sketch-based methods are too
application-specific and lack of generality to be implemented
as primitive operations on network devices [10], [11]. Sketchbased methods require a separate instance of sketch data
structure for each flow key, which is defined according to
the measurement tasks of interest. Therefore, the flexibility
of using the same sketch for various monitoring applications
is limited once it is implemented directly in hardware.
Motivated by the previous works of a ”minimalist” approach
for flow monitoring system [11], [12] and OpenSketch [10],
this paper extends our preliminary work [13] to create generic
hardware infrastructure for a flexible sketch-based traffic
monitoring system. The FPGA based fast data plane proposed
in [13] is designed for flow-level data collection without packet
sampling. It consists of two tables: flow counter table and flow
key table. The flow counter table maintains packet counts and
byte counts, while the flow key table stores 5-tuple keys. The
contents in these tables are sent to host CPUs and processed.
Therefore, as required by different monitoring applications,
it is possible to update various sketch data structures with
flexible selection of flow keys.
In this work, we present the infrastructure design of a flow
counter table supporting greater flexibility of utilizing sketchbased algorithms for traffic monitoring. We need to emphasize
that, in contrast to many works on counter architectures [14]–
[20], the flow counter table implementation in this work is
neither targeted for per-flow traffic measurement nor tried to
minimize the required SRAM size. Instead, it aims to support
the flexibility of sketch-based monitoring applications. Due to
the requirement of updating flow counters in wire speed, hash
table with linear probing for collision resolution is used for
the table implementation. Linear probing is chosen because of
the simplicity in implementation, supporting very fast lookup
and insertion with burst operations in QDRII SRAM. Recent
works showed that using a 5-universal hash function, linear
probing can achieve expected performance [21], [22]. The
average number of probes per update is a constant of no more
than four [23]. Furthermore, multiple-choice hashing scheme
162
2016 IEEE 17th International Conference on High Performance Switching and Routing
with stash [24] is implemented to improve the table utilization.
The main contributions of this work are:
• We propose a flow counter table scheme by using hash
table with linear probing supporting various sketch-based
algorithms for traffic monitoring.
• We present the design detail and performance evaluation
of the FPGA prototype. The system is capable of supporting up to 2 million flow entries and achieving the
throughput of 53 Gbps in the worst-case scenario of 64byte minimum-sized Ethernet frame.
The rest of this paper is organized as follows. Sketch data
structure and its applications for network traffic monitoring
are outlined in Section II. Section III presents the system
architecture for a flexible sketch-based network traffic monitoring. This section also describes the flow counter table
implementation and its evaluation in software. Trace-driven
simulation results by using real traffic trace are presented.
In Section IV, hardware infrastructure of the flow counter
table and its evaluation are explained. Section V discusses
the system’s limitations, scalability, and some design issues
for implementing table with stash in hardware. Section VI
gives the example of applications that can utilize the hardware
infrastructure designed in Section IV. Some related works
are briefly reviewed in Section VII. Finally, Section VIII
concludes this paper.
II. BACKGROUND
Sketch is a data structure representing the compact summaries of a data stream. It can be queried to obtain statistics
about the data stream. A sketch can be arranged as an array
of counters C [i] [j], 1 ≤ i ≤ d, 1 ≤ j ≤ w. It consists of
d hash tables of size w. Each row is associated to a hash
function selected from a universal class of hash functions.
When an item a = (k, v) arrives, the key k is hashed by d
hash functions and the corresponding counters are updated by
v. A query of the sketch can yield the estimated counts for the
key k. For example, query of a key to Count-Min sketch [26]
returns the minimum value among all counters that correspond
to that key. One of the important properties of sketch data
structure is its linearity. Therefore, allowing sketches from
different observation intervals be combined using arithmetic
operations.
To use sketch for traffic measurement, the sketch data
structure must be updated based on a pre-defined flow key,
which is a combination of tuples from the packet header.
The update value can be the packet size or packet count. For
example, the heavy hitters detection application uses source
IP address and the packet count as its key and updated value,
respectively. In change detection application [6], the sketch
is updated using the source IP address and packet size. In
superspreader detection applications [8], [27], [28], the update
process sets the bit in the bitmap sketch based on source IP
address and pair of source and destination IP addresses.
III. S YSTEM A RCHITECTURE
A. Design Overview
The proposed system for flexible implementation of sketchbased traffic monitoring is presented in Fig. 1. The upper part
Monitoring Applications
(Flow monitoring and statistics functions )
Monitoring App. 1
(Entropy estimation )
Monitoring App. 2
(Superspreader
detection)
...
Monitoring App. n
(Change detection )
Query /Estimate
Sketches
...
Sketches
Sketches
Update
Traffic
Header Parser
(Field Extractor)
Hash
Function
Buffer/
Controller
Buffer/
Controller
Flow counter
Table
Bloom Filter
Flow key
Table
Data plane
Fig. 1. System architecture.
is implemented in software on multi-core CPU for parallel
processing of sketches, while the bottom data plane is implemented in FPGA hardware. As highlighted in Fig. 1, the
bottom part consists of two main tables. The flow counter table
stores the flow ID, packet count, and byte count. The flow key
table contains the 5-tuple flow information.
When a packet arrives, its 5-tuple (source IP, destination
IP, source port, destination port, and protocol) header values
are extracted as the flow key by the header parser module,
and hashed by the hash function module. It is important to
maintain the flow counter update in a fixed number of memory
cycles for line-rate packet processing. Therefore, the flow
counter table is constructed based on the scheme of hashing
with linear probing. The 5-universal hash function is used
to achieve an average constant number of probes per update
[23]. Leverage on the four-word burst access of the QDRII
SRAM, the shortest memory update latency can be reached.
A Bloom filter is utilized to check the presence of the flow
in the flow key table. If the flow does not exist in current
observation interval, then the 5-tuple information is added
to the RLDRAM-based flow key table in a first-in first-out
fashion. At the end of observation interval, the values in both
flow counter table and flow key table are moved to the host
CPU. The goal is to adopt various sketch-based algorithms
with arbitrary flow aggregations for monitoring applications
and measurement tasks in a flexible manner. We focus on the
implementation of the flow counter table part in this paper.
B. System Simulation and Evaluation
The proposed system model is constructed first in software
to evaluate the performance of flow counter update operations.
Three schemes with different number of tables are used for
comparison: 1) One table without stash, 2) One table with
stash, and 3) Two tables with stash. Trace-driven simulations
using CAIDA traces [29] are conducted to observe the table
and stash load factors. The CAIDA traces are collected over
a 10Gbps Ethernet link of a Tier 1 ISP. It comprises of 30.8
million packets and 1.4 million distinct 5-tuple flows with a
duration of 60s. The observation interval is set as 15s in the
163
2016 IEEE 17th International Conference on High Performance Switching and Routing
Probe length per update
Table size = 512K
Load factor
Table size = 512K vs. 1M
100
Interval 5s
Interval 10s
Interval 15s
150000
Load factor (%)
Number of flows
200000
Read
to CPU
Ingress
Queues
100000
50000
Interval 5s
Interval 10s
Interval 15s
80
60
PCIe
Input
Arbiter
40
Hash
R/W
Control
FSM
0
0
4
8
12
16
20
24
512K
Probe length
Table size
(a)
(b)
Load factor (1 Table + Stash)
Table size=512K
100
100
80
60
60
40
40
20
20
0
0
Int.10s
Table
Int.15s
Table 1
Stash
(a)
Int.10s
Table 2
Int.15s
Stash
(b)
Number of flows (1 Table + Stash)
Table size=512K
100
Number of flows (2 Tables + Stash)
Table size=2x256K
100
80
80
60
60
40
40
20
20
0
Table
Int.15s
Stash
(c)
QDR
SRAM 2
SRAM
Controller 3
QDR
SRAM 3
stash size can accommodate more than 64K entries.
Two tables (256K entries each) with stash scheme can be
used for better table occupancy with fewer entries in the
stash, as shown in Fig. 3b. In term of flows, 92.6% flows are
stored in both tables and only 7.4% (≈ 35k flows) are placed
into the stash as shown in Fig. 3d. These results demonstrate
that by using two tables, the table occupancy increases while
maintaining the same total table size of 512K entries, but
with less than 64K entries for stash size. Therefore, comparing
to the one table with stash, this scheme can store 7% more
flows in the table and reduce 46.66% entries in the stash The
same trend is observed for the other observation interval. The
average time per update (hash computation, lookup, insertion)
for 30.8 million keys is 1.43 µs on a 2.8 GHz AMD A8-3850
APU with 256KB L1 cache and 4MB L2 cache.
IV. H ARDWARE I MPLEMENTATION
0
Int.10s
SRAM
Controller 2
Fig. 4. Block diagram of the hardware prototype
Load factor (2 Tables + Stash)
Table size=2x256K
80
QDR
SRAM 1
1M
Fig. 2. (a): Number of flows vs. probe length for different observation
intervals. The number of flows in this interval: 5s=208k, 10s=346k, 15s=446k.
For 10s and 15s observation intervals, this figure only shows the number of
flows that have probe length up to 24. (b): Table load factor using different
table sizes for three different length of observation intervals.
Load factor (%)
Field
Extractor
20
0
Percentage (%)
FIFO
SRAM
Controller 1
Table 1
Int.10s
Table 2
Int.15s
Stash
(d)
Fig. 3. (a) and (b): Average table(s) and stash load factors, (c) and (d):
Average number of flows (in percentage) in table(s) and stash for different
observation intervals. Probe length ≤ 4. The average number of flows in 10s
and 15s intervals are 349.3k and 476.5k, respectively.
simulation. In this interval, the number of distinct 5-tuple flows
is ranged from 445k to 510k.
We first show that by using 5-universal hash function on
one hash table without stash scheme, most of the flows have
probe length less or equal to four. Using a 512K-entry table
with 5s and 10s observation intervals, the percentage of flows
that have probe length less or equal to four are 98.97% and
92.99%, respectively. With observation interval of 15s, 84.1%
of the flows have probe length less or equal to four. The
maximum probe length of 350 is observed for only one flow.
After all flows were inserted into the table, the load factor is
85.13%. Probe length per flow key for different observation
intervals is shown in Fig. 2a. The table load factor for different
observation intervals and table sizes is shown in Fig. 2b.
For one table with stash scheme, the maximum probe length
is limited to four. Fig. 3a shows the table and stash load
factors for different observation intervals. In 15s interval for
one table with stash scheme, the table and stash load factors
are 78.16% and 51%, respectively. The percentage of flows
that are stored in table and stash is shown in Fig. 3c. As
shown, about 86% of flows are stored in the table and the
remaining 14% (≈ 66k flows) are placed into the stash. The
The flow counter table hardware prototype is implemented
on the NetFPGA-10G [30] platform which contains a Virtex5 XC5VTX240T FPGA, four 10Gb Ethernet SFP+ interfaces,
three banks of 9 MB QDRII SRAMs with a total capacity of
27 MB, and four 72 MB RLDRAM II memories. The board
is hosted on a commodity workstation and attached to the
PCIe slot. The prototype block diagram is shown in Fig. 4.
The four 10Gb Ethernet ports, each has a 64-bit data path, are
connected to four ingress queues which collect the incoming
packets from network. The Input arbiter module takes packets
from the four ingress queues using a round robin policy and
sends it to the next module in the data path. The output of the
input arbiter module is a 256-bit data bus. The field extractor
module extracts the 104-bit 5-tuple and 16-bit packet length
from each packet and sends it to the hash function module.
A. Hash module
This module utilizes the scheme of collision-free mapping
with high probability [23]. The scheme processes the 104-bit
5-tuple input to obtain its 32-bit flow ID using a 2-universal
tabulation-based hashing (TabChar). TabChar hashing computes a hash value by using table lookups and XOR operations
so the operation is fast and takes only one clock cycle. Small
amount of on-chip memory space is used to store the lookup
tables. The 32-bit flow ID is then hashed again using a 5universal hash function. It is computed based on the four
degree polynomial over some prime fields as follows.
4
i
h (x) =
mod p
(1)
ai x
164
i=0
2016 IEEE 17th International Conference on High Performance Switching and Routing
TABLE I
L OGIC R ESOURCE C ONSUMPTION FOR H ASH F UNCTIONS
Hash Function
TabChar
5-Univ Mult.
# of Registers
32 of 149760
613 of 149760
# of LUTs
97 of 149760
984 of 149760
144 bits
# of BRAMs
7 of 324
0 of 324
SRAM 3
Addr.
0
1
B. Read/Write Control module
Updating counters in the table requires a read-modify-write
operation for each incoming packet. The Read/Write Control
module manages the memory read and write processes. It also
takes care of the read-after-write hazard that may occur during
counter update process. A read-after-write hazard occurs if an
input ID enters the pipeline before previous update on the same
input has been completed. The circuit compares the current
input and address to previous input and address (hash value),
then updates the counters accordingly to ensure that values
written to memory are correct. In READ state, the current
input ID and its address are compared to previous data stored
in shift registers. If a match is found, then the corresponding
input ID’s most recent data from the shift register are used.
Otherwise, a memory read is issued to get the previous data
of the current input ID from memory. The packet and byte
counts are then updated and written back to memory in
WRITE state. This module also handles the registers read/write
process to communicate with software. Register infrastructure
on NetFPGA-10G provides access to software running on host
CPU to read data from hardware via the PCIe interface.
C. Flow Counter Table
Three QDRII SRAMs are used to store the flow counter
table. Hash collision is handled by comparing the input key
to multiple entries in a bucket in parallel. The mechanism
makes use of the four-word burst access of the QDRII SRAM.
A read burst can retrieves four 36-bit words (144 bits) from
one QDRII SRAM chip in only five memory clock cycles.
Therefore, four IDs can be read and compared in parallel to the
input ID with very short latency. If the input is a new ID, a flow
entry is created and its counters are updated. Otherwise, only
the counters are updated and then written back to memory.
Due to the resource restriction in NetFPGA-10G platform,
only one table without stash scheme is implemented for fast
prototyping. A flow entry of 108-bit data is stored in three
banks of QDRII SRAM. It consists of 16-bit ID, 32-bit packet
counter, and 60-bit byte counter. The 16-bit ID is obtained by
hashing the 32-bit flow ID (i.e., the result from TabChar) using
a 2-universal hash function. The flow entries layout is shown
in Fig. 5. The three QDRII SRAMs can accommodate up to
2 million flow entries.
D. Synthesis Result and Evaluation
The hardware implementation has been synthesized and
evaluated by sending packets to the four 10GbE interfaces
...
SRAM 2
...
The 5-universal hash function is implemented in a pipeline
fashion with seven embedded multipliers. The logic resources
consumed by the TabChar and 5-universal hash are shown in
Table I.
512K
SRAM 1
36-bit
36-bit
36-bit
36-bit
Fig. 5. Flow entries layout in QDRII SRAM. This figure shows 8 different
flow entries represented by different colors. A 108-bit flow entry is stored in
36-bit word across three banks of QDRII SRAM.
TABLE II
L OGIC R ESOURCES U TILIZATION
Resources
# of Registers
# of LUTs
# of Block RAMs
Utilization
61362 of 149760
57110 of 149760
144 of 324
%
40%
38%
44%
using Xilinx ISE/EDK design tools. The system is able to
process one packet in two clock cycles at 210 MHz. Therefore,
it can process 105 million packets per second. Assuming
the worst case scenario of minimum Ethernet frame size of
64 bytes, the throughput is 53 Gbps. The logic resources
utilization for the whole system is shown in Table II. The
prototype requires less than 50% logic resources of the FPGA.
These results show that our scheme is feasible for hardware
implementation.
V. D ISCUSSION
A. Limitations and Scalability
Current implementation of the system is limited only to
process IPv4 flows. With 108 bits for each flow entry, approximately 2 million flow entries can be accommodated in
the QDRII SRAM of 27 MB. Nevertheless, the flow counter
table in the proposed system is scalable to store more than
2 million flow entries. The number of flow entries that can
be stored in the flow counter table depends mainly on the
SRAM size. Moreover, the SRAM space required for the flow
counter table is depend on the total counter width for holding
the flow ID, packet count and byte count. More flow entries
can be accommodated by using the latest SRAM architecture
such as QDR-IV SRAM of larger size. In order to support IPv6
flow monitoring, both header parser and hash function modules
need to be modified to satisfy IPv6 format and requirement
for its flow label. For IPv6 flows, the 5-tuple is longer than
104 bits in IPv4. However, this will only affect the DRAM
space requirement because in the system, the 5-tuple flow
key is stored in DRAM. Since the DRAM space available is
usually larger compared to SRAM (e.g. the newer NetFPGA
SUME board has 8 GB DDR3 DRAM), storing IPv6 flows
is straightforward. Comparing to that of IPv4, the processing
IPv6 flow may take more time (clock cycles) affecting system
throughput.
165
2016 IEEE 17th International Conference on High Performance Switching and Routing
B. Table with Stash Implementation
VI. M ONITORING A PPLICATIONS
Monitoring applications that utilize sketch data structures
can take advantage from our flow-level data collection infrastructure. In this section, the stream-based entropy estimation
and superspreader detection are presented based on the proposed infrastructure.
Entropy estimation: Entropy is a measure of the amount
of randomness of information content in the distribution of
random variables. It has been used in network traffic anomaly
detection [9], [31]. The entropy value is zero if all items in the
traffic are the same, and is maximum if all items in the traffic
are distinct. The simulation of entropy norm estimation is
conducted based on the contents provided by the flow counter
table and flow key table. The results are compared to the exact
entropy norms. The accuracy is measured in term of relative
error. Fig. 6a shows the experiment results of entropy norm
estimation for 5-tuple flow. Different Bloom filter sizes are
used in the experiments. As shown, the application can yield
estimation errors in an acceptable range for anomaly detection.
The estimation accuracy increases with the size of Bloom filter.
For Bloom filter size greater than 256KB, the estimation errors
are less than 1%.
Superspreader detection: The goal of superspreader detection is to find source IPs that contact a large number of
destination IPs. In this case, the application can utilized the
source-destination IP pairs data from the flow key table.
Here, a superspreader is defined as a source whose fan-out
exceeds 200. False negative ratio (FNR) and false positive ratio
(FPR) are used to measure the superspreader identification
accuracy, while the fan-out estimation accuracy is obtained
from scatter plot. The FNR is defined as the number of
superspreaders that are not identified divided by the number
of actual superspreaders, while FPR is the number of nonsuperspreaders that are incorrectly identified as superspreaders
0.4
Bloom filter size 512KB
1024KB
0.3
0.2
0.1
0
1
2
3
Interval
Superspreader detection, 2HashTable
Est. source fan-out
Relative error (%)
Entropy estimation, 5-tuple
Implementation of table with stash scheme in hardware
requires more logic and memory resources since d hash
functions, d tables, and a small table for the stash are required.
The stash is utilized to store flows that cannot be accommodated in the table (i.e., those flows whose probe length
are more than four in our implementation). The stash can
be implemented using a content-addressable memory (CAM)
in hardware. While designing a high performance CAM on
FPGA is challenging, another issue is that the CAM size is
limited and may only store a small amount of flow entries.
Assuming that we need to store up to 64K flow entries, the
CAM size needed is 64K × 108-bit which is approximately
7 Mbit in total. Larger stash size can be implemented using
data structure such as binary search tree (BST) based on the
on-chip Block RAM in the FPGA. However, the Block RAM
is also used by other modules (hash functions, Bloom filter,
etc.) in the design. Therefore, more detail design consideration
is needed. As mentioned previously, due to the resource
restriction in NetFPGA-10G platform, only one table without
stash scheme is implemented. The table with stash scheme
hardware implementation will be addressed in future work.
4
2500
1250
500
200
200
500
1250
2500
Exact source fan-out
(a)
(b)
Fig. 6. (a): Entropy estimation results of 5-tuple flow for different Bloom
filter sizes. (b): Exact vs. estimated fan-out in Interval 4. The number of actual
superspreaders in this interval is 76.
divided by the number of actual superspreaders. To identify
the superspreaders and count their fan-outs, the application
maintains two hash tables. One hash table is used to remove
duplicated source-destination IP pairs, and the other hash table
for counting the fan-out. This method is traditionally used for
exact fan-out counting and is very accurate. The experiment
results show that there are no false positives (i.e., the FPR
values are zero) in all observation intervals, while the FNR
values in both second and third observation intervals are 0.01.
The FNR is due to the false positive of Bloom filter when
the 5-tuple is added into the flow key table. The accuracy of
fan-out estimation is shown in Fig. 6b. The closer the points
to the diagonal line, the more accurate the estimation is. The
average errors in the fan-out estimation is 0.11%.
VII. R ELATED W ORK
Some works have been proposed to create a universal,
minimalist and flexible network flow monitoring system. In
[11], Sekar et al. proposed a ”minimalist” approach for flow
monitoring. The flow-level data is collected by using combination of flow sampling (FS), sample-and-hold (SH), and
coordinated sampling (cSamp). The flow counters stored in
the SRAM are divided for FS and SH. The goal of ”minimalist” approach is to minimize the complexity and resource
requirements for flow monitoring implementation on network
device by using simpler flow level-data collection. Rather than
possessing individual counter in each of the monitoring application, the counter design in ”minimalist” approach can be
used for all monitoring applications. Based on the trace-driven
evaluation, the proposed scheme achieves the similar level
of accuracy to those of application-specific approaches with
the same resources. However, they only provide assumption
and justification on the hardware implementation feasibility,
processing requirements, and memory consumption without
real implementation.
In line with [11], Liu et al. [12] developed a universal monitoring architecture (UnivMon), where a simple and
generic monitoring primitive runs on network device. Using
this simple primitive’s data structure, high accuracy results for
a broad spectrum of monitoring tasks can be obtained. The
results are expected at least to be equivalent or better than
the approach that using custom sketches for each monitoring
task. The UnivMon data plane utilizes parallel Count sketches
[32] to collect flow-level data. Feasibility of a hardware
implementation is not discussed in the paper.
166
2016 IEEE 17th International Conference on High Performance Switching and Routing
OpenSketch [10] uses a hash-based measurement data plane
and sketch data structure to count the flows. The goal of
OpenSketch is to provide a generic and efficient means of measuring network traffic by separating the measurement control
plane from the data plane functions. It supports customized
flow-level data collection such that, the flow to collect and
measure is selected by using TCAM-based classification. After
an observation interval, the sketch counters stored in SRAM
are sent to controller (control plane, where the measurement libraries and monitoring applications reside) for further analysis.
Compared to our flow counter table design, in OpenSketch, the
SRAM is divided into a list of logical tables because different
sketches require different numbers and sizes of counters. This
approach makes the counter access mechanism (addressing)
more complex.
VIII. C ONCLUSION AND F UTURE W ORK
This work proposes a hardware-accelerated infrastructure
which is capable of supporting various sketch-based network
traffic measurement tasks with great flexibility. The implementation detail of flow counter table based on hash table with linear probing is described. The proposed flow counter table has
been evaluated in software simulation with real-world traffic
trace for network traffic entropy estimation and superspreader
detection. The system is implemented on the NetFPGA-10G
platform and capable of processing network traffic at 53 Gbps
in a worst-case scenario of 64-byte minimum-sized Ethernet
frame. For future works, we plan to evaluate the tables with
stash scheme in hardware, and further optimize the hardware
design to achieve higher performance.
ACKNOWLEDGMENT
This research was funded in part by the National Science
Council, Taiwan, under contract number: MOST 104-2221-E033-007 and MOST 103-2221-E-033-030.
R EFERENCES
[1] R. Hofstede, P. Celeda, B. Trammell, I. Drago, R. Sadre, A. Sperotto, and A. Pras,
“Flow Monitoring Explained: From Packet Capture to Data Analysis With NetFlow
and IPFIX,” IEEE Communications Surveys Tutorials, vol. 16, no. 4, pp. 2037–
2064, 2014.
[2] “Cisco NetFlow http://www.cisco.com.”
[3] J. Mai, C.-N. Chuah, A. Sridharan, T. Ye, and H. Zang, “Is sampled data sufficient
for anomaly detection?” in Proc. of the 6th ACM SIGCOMM Conference on
Internet Measurement, IMC ’06. New York, NY, USA: ACM, 2006, pp. 165–176.
[4] D. Brauckhoff, B. Tellenbach, A. Wagner, M. May, and A. Lakhina, “Impact
of packet sampling on anomaly detection metrics,” in Proc. of the 6th ACM
SIGCOMM Conference on Internet Measurement, IMC ’06. New York, NY,
USA: ACM, 2006, pp. 159–164.
[5] C. Estan and G. Varghese, “New directions in traffic measurement and accounting,”
in Proc. of the 2002 Conference on Applications, Technologies, Architectures, and
Protocols for Computer Communications, SIGCOMM ’02. New York, NY, USA:
ACM, 2002, pp. 323–336.
[6] B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen, “Sketch-based change detection:
Methods, evaluation, and applications,” in Proc. of the 3rd ACM SIGCOMM
Conference on Internet Measurement, IMC ’03. ACM, October 2003, pp. 234–
247.
[7] A. Kumar, M. Sung, J. J. Xu, and J. Wang, “Data streaming algorithms for efficient
and accurate estimation of flow size distribution,” in Proc. of the Joint International
Conference on Measurement and Modeling of Computer Systems, SIGMETRICS
’04/Performance ’04. New York, NY, USA: ACM, 2004, pp. 177–188.
[8] Q. Zhao, A. Kumar, and J. Xu, “Joint data streaming and sampling techniques for
detection of super sources and destinations,” in Proc. of the 5th ACM SIGCOMM
Conference on Internet Measurement, IMC ’05. Berkeley, CA, USA: USENIX
Association, 2005, pp. 77–90.
[9] A. Lall, V. Sekar, M. Ogihara, J. Xu, and H. Zhang, “Data streaming algorithms for
estimating entropy of network traffic,” SIGMETRICS Perform. Eval. Rev., vol. 34,
no. 1, pp. 145–156, 2006.
[10] M. Yu, L. Jose, and R. Miao, “Software defined traffic measurement with
OpenSketch,” in Proc. of the 10th USENIX Conference on Networked Systems
Design and Implementation, NSDI’13. Berkeley, CA, USA: USENIX Association,
2013, pp. 29–42.
[11] V. Sekar, M. K. Reiter, and H. Zhang, “Revisiting the case for a minimalist
approach for network flow monitoring,” in Proc. of the 10th ACM SIGCOMM
Conference on Internet Measurement, IMC ’10. New York, NY, USA: ACM,
2010, pp. 328–341.
[12] Z. Liu, G. Vorsanger, V. Braverman, and V. Sekar, “Enabling a ”RISC” Approach
for Software-Defined Monitoring Using Universal Streaming,” in Proc. of the 14th
ACM Workshop on Hot Topics in Networks, HotNets-XIV. New York, NY, USA:
ACM, 2015, pp. 21:1–21:7.
[13] T. Wellem, Y.-K. Lai, and W.-Y. Chung, “A Software Defined Sketch System for
Traffic Monitoring,” in Proc. of the 11th ACM/IEEE Symposium on Architectures
for Networking and Communications Systems, ANCS ’15. Oakland, CA, USA:
IEEE Computer Society, 2015, pp. 197–198.
[14] D. Shah, S. Iyer, B. Prahhakar, and N. McKeown, “Maintaining statistics counters
in router line cards,” IEEE Micro, vol. 22, no. 1, pp. 76–81, Jan. 2002.
[15] S. Ramabhadran and G. Varghese, “Efficient Implementation of a Statistics Counter
Architecture,” in Proc. of the 2003 ACM SIGMETRICS International Conference
on Measurement and Modeling of Computer Systems, SIGMETRICS ’03. New
York, NY, USA: ACM, 2003, pp. 261–271.
[16] Q. Zhao, J. Xu, and Z. Liu, “Design of a Novel Statistics Counter Architecture
with Optimal Space and Time Efficiency,” in Proc. of the Joint International
Conference on Measurement and Modeling of Computer Systems, SIGMETRICS
’06/Performance ’06. New York, NY, USA: ACM, 2006, pp. 323–334.
[17] Y. Lu, A. Montanari, B. Prabhakar, S. Dharmapurikar, and A. Kabbani, “Counter
braids: a novel counter architecture for per-flow measurement,” in Proc. of the
2008 ACM SIGMETRICS international conference on Measurement and modeling
of computer systems, SIGMETRICS ’08. New York, NY, USA: ACM, 2008, pp.
121–132.
[18] N. Hua, J. Xu, B. Lin, and H. Zhao, “BRICK: A Novel Exact Active Statistics
Counter Architecture,” IEEE/ACM Transactions on Networking, vol. 19, no. 3, pp.
670–682, Jun. 2011.
[19] T. Li, S. Chen, and Y. Ling, “Per-flow Traffic Measurement Through Randomized
Counter Sharing,” IEEE/ACM Trans. Netw., vol. 20, no. 5, pp. 1622–1634, Oct.
2012.
[20] C. Hu, B. Liu, H. Zhao, K. Chen, Y. Chen, Y. Cheng, and H. Wu, “Discount
Counting for Fast Flow Statistics on Flow Size and Flow Volume,” IEEE/ACM
Transactions on Networking, vol. 22, no. 3, pp. 970–981, Jun. 2014.
[21] A. Pagh and R. Pagh, “Linear probing with constant independence,” in In Proc.
of the 39th annual ACM symposium on Theory of computing, STOC ’07. ACM
Press, 2007, pp. 318–327.
[22] M. Patrascu and M. Thorup, “On the k-independence required by linear probing
and minwise independence,” in Proc. of the 37th International Colloquium
Conference on Automata, Languages and Programming, ICALP’10.
Berlin,
Heidelberg: Springer-Verlag, 2010, pp. 715–726.
[23] M. Thorup and Y. Zhang, “Tabulation based 5-universal hashing and linear probing,” in Proc. of the 12th Workshop on Algorithm Engineering and Experiments
ALENEX, 2010, pp. 62–76.
[24] A. Kirsch and M. Mitzenmacher, “The power of one move: Hashing schemes for
hardware,” IEEE/ACM Trans. Netw., vol. 18, no. 6, pp. 1752–1765, Dec. 2010.
[25] Y.-K. Lai, T. Wellem, and H.-P. You, “Hardware-assisted estimation of entropy
norm for high-speed network traffic,” Electronics Letters, vol. 50, no. 24, pp.
1845–1847, 2014.
[26] G. Cormode and S. Muthukrishnan, “An improved data stream summary: The
count-min sketch and its applications,” Journal of Algorithms, vol. 55, no. 1, pp.
58–75, April 2005.
[27] T. Wellem, G.-W. Li, and Y.-K. Lai, “Superspreader Detection System on NetFPGA
Platform,” in Proc. of the Tenth ACM/IEEE Symposium on Architectures for
Networking and Communications Systems, ANCS ’14. New York, NY, USA:
ACM, 2014, pp. 247–248.
[28] M. Yoon, T. Li, S. Chen, and J.-K. Peir, “Fit a Compact Spread Estimator in Small
High-Speed Memory,” IEEE/ACM Transactions on Networking, vol. 19, no. 5, pp.
1253–1264, Oct. 2011.
[29] “CAIDA UCSD Anonymized Internet Traces,” 2012.
[30] “NetFPGA project site. http://netfpga.org.”
[31] A. Lakhina, M. Crovella, and C. Diot, “Mining anomalies using traffic feature
distributions,” SIGCOMM Comput. Commun. Rev., vol. 35, no. 4, pp. 217–228,
2005.
[32] M. Charikar, K. Chen, and M. Farach-Colton, “Finding frequent items in data
streams,” in Proc. of the 29th International Colloquium on Automata, Languages
and Programming. Springer-Verlag, July 2002, pp. 693–703.
167