You are on page 1of 6

TCP SPLITTER: A TCP/IP FLOW MONITOR IN RECONFIGURABLE HARDWARE

THIS FLOW-MONITORING CIRCUIT DELIVERS AN ORDERED BYTE STREAM TO A


CLIENT APPLICATION FOR EVERY TCP/IP CONNECTION IT PROCESSES. USING AN ACTIVE FLOW-PROCESSING ALGORITHM, TCP SPLITTER IS A LIGHTWEIGHT, EFFICIENT DESIGN THAT SUPPORTS THE MONITORING OF AN ALMOST UNLIMITED NUMBER OF FLOWS AT MULTIGIGABIT LINE RATES.

David V. Schuehler John W. Lockwood Washington University in St. Louis

High-speed network switches currently operate at OC-48 (2.5 gigabits per second) line rates, and faster OC-192 (10 Gbps) and OC-768 (40 Gbps) networks are on the horizon. At the same time, network trafc continues to increase.1 Studies have found that more than 85% of the packets traveling on the Internet are based on the Transmission Control Protocol/Internet Protocol (TCP/IP).2,3 The latest networking processing systems require scanning and processing of data in both headers and payloads of TCP/IP packets. To scan payloads at high rates, these systems need new methods of processing TCP/IP data in hardware. A hardware implementation of a full TCP/IP protocol stack acting as a communication end point would be useful. Unfortunately, several problems make the full implementation of a TCP/IP stack in hardware impractical; these include the need for many TCP timers, the need for large memories for reassembly buffers, and the need to support many connections. At Washington Universitys Applied Research Laboratory, we have developed a TCP-ow-

monitoring circuit that provides client application systems with an ordered TCP data stream. Instead of acting as a connection end point for a few TCP connections, this circuit, called TCP Splitter, monitors all TCP ows passing through the network hardware. This technique has many advantages over implementing a TCP end point. For reliable delivery of a data set to a client application, a TCP connection only needs to transit the device monitoring the data. The TCP end points, not the logic on the network hardware, manage the work for guaranteed delivery. Because the retransmission logic remains at the connection end points, not in the active network switch, the lightweight monitor does not require a complex protocol stack. (The Related work sidebar summarizes other TCP/IP-monitoring research.)

Background
Current-generation field-programmable gate arrays (FPGAs) have approximately the capacity of a million-gate application-specic IC (ASIC), a few hundred Kbytes of on-chip memory, and operation speeds ranging from 50 to 200 MHz. By placing FPGAs in the

54

Published by the IEEE Computer Society

0272-1732/03/$17.00 2003 IEEE

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 28, 2009 at 09:52 from IEEE Xplore. Restrictions apply.

data path of a high-speed network switch, designers can implement network-processing functions without reducing the switchs overall throughput. The Applied Research Laboratory developed the Washington University Gigabit Switch as a research platform for high-speed networks.4 We used this hardware, along with the Field-Programmable Port Extender (FPX), as the testbed for the TCP Splitter project.5 For TCP Splitters foundation, we used components of the layered protocol wrappers developed for the FPX.6 These wrappers process high-level packets in reprogrammable logic. They include an asynchronous transfer mode (ATM) cell wrapper, an ATM adaptation layer 5 (AAL5) frame wrapper, an IP wrapper, and a user datagram protocol (UDP) wrapper. This set of wrappers lets a client application send and receive packets with FPGA hardware. We used the cell, frame, and IP wrappers as a framework in which to implement TCP Splitter.

Related work
Protocol analyzers and packet-capturing programs have been around as long as there have been protocols and networks to monitor. These tools provide a wide range of capabilities for capturing and saving network data. Programs such as tcpdump capture and store TCP packets.1 These tools work well for monitoring data at low bandwidth rates, but their performance is limited because they execute in software. With these tools, reconstructing TCP data streams requires postprocessing. HTTPDUMP captures and stores Web-based hypertext transfer protocol (HTTP) trafc,2 but as a result of the extra ltering logic for processing HTTP trafc, this tool requires more processing and runs slower than tcpdump. PacketScope, developed at AT&T, monitors much larger volumes of network trafc but relies on tcpdumps capabilities to perform packet capturing.3 BLT (bilayer tracing) leverages the PacketScope monitor to perform HTTP monitoring of links with line speeds greater than 100 Mbps.4 This tool does not ensure the processing of all packets but instead attempts to obtain statistically relevant results by capturing a large portion of the HTTP trafc. The Internet Protocol Scanning Engine is another software-based TCP/IP monitor.5 Typically, it captures only header information, which it writes to a log le for TCP stream content. This program also has performance limitations that preclude it from monitoring high-bandwidth trafc. The Cluster-Based Online Monitoring System does a much better job of capturing data associated with Web requests.6 Multiple analysis engines working in parallel improve its performance over other systems; yet even with eight analysis engines, it does not consistently monitor trafc on a 100-Mbps network. None of these solutions can operate in a high-speed active networking environment where data rates exceed 1 Gbps, nor can they guarantee the processing of every byte of data on the network. Researchers at the Georgia Institute of Technology have developed a TCP state-tracking engine with buffer reassembly.7 This project focuses on detecting intrusion and tracking a single connections TCP/IP processing state. The state-tracking engine also performs limited buffer reassembly. This solution uses a hardware environment similar to that of TCP Splitter and processes data at equal line rates. By instantiating multiple processing circuits, the engine monitors a maximum of 30 TCP/IP connections simultaneously on a single eld-programmable gate array.

Design requirements
TCP Splitter is a lightweight, high-performance circuit that contains a simple client interface and can monitor an almost unlimited number of flows. To achieve this result within the practical bounds of todays hardware, we made design tradeoffs. Some of the issues we faced were handling dropped and reordered packets, maintaining state for numerous ows, processing data at line rates, and minimizing hardware gate count. To overcome these challenges, we restricted the way data ows through the network switch. TCP Splitter must reside in the network through which all packets of monitored ows will pass. All packets associated with a monitored TCP/IP connection must pass through the networking node where monitoring takes place. It would be impossible to provide a client application with a consistent TCP byte stream from a connection if the switch performing the monitoring processed only a fraction of the TCP packets. Generally, this requirement applies to the edge routers but not to interior Internet nodes. Private networks designed to pass traffic in a certain manner can also enforce this requirement.

References
1. S. McCanne, C. Leres, and V. Jacobson, tcpdump, http://www.tcpdump.org/. 2. R. Wooster, S. Williams, and P. Brooks, HTTPDUMP Network HTTP Packet Snooper, http://citeseer.nj.nec.com/332269.html, 1996. 3. N. Anerousis et al., Using the AT&T Labs PacketScope for Internet Measurement, Design, and Performance Analysis, http://citeseer.nj.nec.com/477885.html, 1997. 4. A. Feldmann, BLT: Bi-Layer Tracing of HTTP and TCP/IP, WWW9/Computer Networks, vol. 33, no. 1-6, 2000, pp. 321-335; http://citeseer.nj.nec.com/ feldmann00blt.html. 5. I. Goldberg, Internet Protocol Scanning Engine, http://www.cs.berkeley.edu/ ~iang/isaac/ipse.html. 6. Y. Mao et al., Cluster-Based Online Monitoring System of Web Trafc, Proc. 3rd Intl Workshop Web Information and Data Management, ACM Press, 2001, pp. 47-53. 7. M. Necker, D. Contis, and D. Schimmel, TCP-Stream Reassembly and State Tracking in Hardware, Proc. 10th Ann. Symp. Field-Programmable Custom Computing Machines (FCCM 02), IEEE CS Press, 2002, pp. 286-287.

An active solution
A problem with attempting to monitor

JANUARYFEBRUARY 2003
Authorized licensed use limited to: IEEE Xplore. Downloaded on April 28, 2009 at 09:52 from IEEE Xplore. Restrictions apply.

55

HOT INTERCONNECTS 10

n retransmission logic.8 Its benet or detriment to throughput depends on the TCP flow specic TCP implementations in use at the end points. In TCP flow TCP flow cases where the receiving TCP Source Destination TCP Splitter stack is performing go-back-n sliding-window behavior, actively dropping frames Protocol wrappers might improve overall network throughput by eliminatFigure 1. TCP Splitters data ow. ing packets that the receiver will discard. On the other hand, when the end points use many TCP/IP flows is that such a system selective retransmission and the percentage of would require a large amount of memory. The network data loss is high, the TCP Splitter maximum window scale factor that TCP sup- operation potentially can exacerbate the ports is 214, which leads to a maximum win- dropped packet problem and render the condow size of 1 Gbyte.7 In a worst-case scenario, nection unusable. reassembling packets in each direction of a TCP connection would require that much Architecture memory. A high-speed switch monitoring We implemented TCP Splitter in FPGA both directions of 128,000 connections would hardware; it fits within the FPX protocol require 256 Tbytes of high-speed RAM. Even wrapper framework. Figure 1 shows a highif we assume that TCP window size is limit- level view of the data ow through TCP Splited to 1 Mbyte, the system still requires 256 ter. IP frames enter TCP Splitter from the IP Gbytes of memory to monitor both directions protocol layer contained in the protocol wrapof the same 128,000 connections. This pro- pers.6 TCP Splitters name reects the fact that hibitive quantity of memory led us to con- the circuit splits the TCP byte stream into two sider other lightweight designs. separate flows. As Figure 1 shows, one flow We chose a design that eliminated the need goes to the client application on a local host, for reassembly buffers. If all frames for a par- while the other goes to the destination. ticular ow transit the switch in order, we could Figure 2 shows TCP Splitters layout. develop a solution that does not need reassem- Inbound frames enter TCP Splitter, which bly buffers. Because there is no guarantee that classies, checksums, and caches them. OutTCP frames will cross the network in order, bound IP frames go back to the IP wrapper some action must occur when packets are out and then to the next-hop router. TCP Splitof order. By actively dropping out-of-order ter also delivers a TCP byte stream to the packets, TCP Splitter can provide an ordered client application for each TCP flow in the TCP byte stream to the client application with- network. out requiring reassembly buffers. If TCP SplitTCP Splitter consists of two logical secter detects a missing packet, it actively drops tions. The first, TCP input, handles the subsequent packets until the sender retransmits ingress of IP frames. This section performs the missing packet. This ensures in-order pack- most of TCP Splitters processing. The secet ow through the switch. ond section, TCP output, handles packet This design feature forces the TCP connec- routing and frame delivery to the outbound tions into a go-back-n sliding-window mode IP stack and the client application. when a packet is dropped upstream of the monitoring node. As it turns out, machines Input processing throughout the Internet use the go-back-n As Figure 2 shows, the TCP input section retransmission policy. Many TCP implemen- consists of six components. The ow classier, tations, including those of Windows 98, the checksum engine, the input state machine, FreeBSD 4.1, and Linux 2.4, use the go-back- the control FIFO, and the frame FIFO all
Client application

56

IEEE MICRO
Authorized licensed use limited to: IEEE Xplore. Downloaded on April 28, 2009 at 09:52 from IEEE Xplore. Restrictions apply.

process IP packet data received Client from the IP protocol wrapper. application The output state machine retrieves data from the control TCP processing and frame FIFOs and sends it to the TCP output section. TCP input TCP output IP frames enter the input Flow section 32 bits at a time. The classifier checksum engine computes the TCP checksum using the Checksum appropriate bits in each data engine Packet routing word. The frame FIFO also IP input and stores the input data so that the frame delivery Input state output state machine can send machine the TCP checksum result to the output section along with Control FIFO the start of the IP packet. Once the checksum engine Frame computes the TCP checkFIFO sum, it writes information about the current frame to the control FIFO. This data includes the checksum result (pass or fail), the ow identiFigure 2. TCP Splitters layout. er, the start and end of ow signals, a TCP frame indicator, and a signal that indicates whether or not this ow classier does not handle hash table the output section should forward the frame collisions, which cause TCP Splitter to process only to the destination. The control FIFO packets from different ows as if they were a holds state information for smaller frames single connection. while the output state machine is still retrievThere are many recent innovations in highing preceding larger frames from the frame performance ow classiers capable of operating at network line speeds. We could use FIFO for output processing. Upon detecting a nonempty control FIFO, many of these classifiers to identify traffic the output state machine starts reading the ows for TCP Splitter. Switchgen is a tool that next frame from the frame FIFO. The output transforms packet classification rules into state machine passes this frame data and the recongurable hardware-based circuit design.9 associated control signals from the control The recursive flow classification algorithm, another high-performance classication techFIFO to the TCP output section. nique, optimizes rules by removing redunFlow classication dancy.10 Both of these research projects are TCP Splitters simple flow classifier can developing ow classiers to perform 30 miloperate at high speed and has minimal hard- lion to 100 million classications per second. ware complexity. The ow table is a 262,144- The aggregate bit vector approach reduces the element array contained in a low-latency number of required memory lookups to static-RAM chip. Each table entry contains achieve a high-performance classifier sup33 bits of state information. An 18-bit hash porting large rule sets.11 of the source IP address, destination IP Other researchers have proposed a packet address, source TCP port, and destination classification solution that performs lookups TCP port serves as the index into the flow using a series of pipelined SRAMs.12 This table. The detection of a TCP FIN or RST technology could support 1 billion packet ag signals the end of a TCP ow, and clears classification lookups per second. TCP Splitthe hash table entry for that ow. Currently, ter imposes no restrictions on the flow clasOutput state machine

IP output

JANUARYFEBRUARY 2003
Authorized licensed use limited to: IEEE Xplore. Downloaded on April 28, 2009 at 09:52 from IEEE Xplore. Restrictions apply.

57

HOT INTERCONNECTS 10

sification technique and can use any flow classifier.

Results
We implemented TCP Splitter as a module on a Xilinx Virtex XCV1000E-7 FPGA. We synthesized the circuit to provide processing at full OC-48 line speeds on the FPX platform. It operates at a post-place-and-route frequency of 101 MHz and has a corresponding throughput of 3.2 Gbps. The designs critical path includes the 16-bit arithmetic operations that compute the TCP checksum. The TCP Splitter implementation is smallit uses only 2 percent of the FPGA. A complete solution, including TCP Splitter, protocol wrappers, and a sample client application that simply counts TCP data bytes, requires 21 percent of the FGPAs resources. TCP Splitter has a pipeline delay of only seven clock cycles, which introduces a total data path delay of 70 ns. To avoid forwarding erroneous frames, TCP Splitter adds one store-andforward delay allowing for the time to compute and verify the TCP checksum.

Output processing
TCP Splitters output-processing section determines how a packet should be processed. There are three possible choices: Packets can go to the outbound IP layer only, go to both the outbound IP layer and the client application, or be discarded. Packets containing sequence numbers greater than the expected sequence number represent packets that TCP Splitter has already processed. TCP Splitter forwards these packets to the destination to account for packets that were dropped between the monitor and the destination. The rules for processing packets are as follows: Send all non-TCP packets to the outbound IP stack. Drop all TCP packets with invalid checksums. Send all TCP packets with sequence numbers less than or equal to the current expected sequence number to the outbound IP stack. Drop all TCP packets with sequence numbers greater than the current expected sequence number. Send all SYN packets to the outbound IP stack. Send all other packets to both the outbound IP stack and the client application.

Future work
We plan to increase TCP Splitters throughput to support OC-768 line rates. To accomplish this goal, we will exploit additional pipeline stages and parallelism available in the FPGA. In the current implementation, the flow classier performs a maximum of two memory accesses for each packet. Given that TCP Splitters input data width is 32 bits (4 bytes), and assuming minimum-length packets of 64 bytes, the smallest operation period is 16 clock cycles. In that time, eight TCP Splitter engines could run in parallel and perform one memory access on every clock cycle. By using both of the static-RAM modules on the FPX platform, we could design a solution with 16 TCP Splitter engines, each operating at 101 MHz, which would process data at 51 Gbps. This is sufcient bandwidth to monitor all TCP/IP ows at OC-768 line rates. Another planned enhancement is the addition of a few packet reassembly buffers. These buffers would support the reassembly of IP fragments and TCP packets to provide a passive monitoring solution. We also plan improvements of the ow classier to eliminate the hash table collision problem.

Client interface
The client interface provides a simple hardware interface for application circuits. The interface provides only valid, checksummed, in-order TCP packet data for each ow to the client application. The client processes only the ordered byte stream from the TCP connection. The interface clocks all packet headers into the client application, along with a start-of-header signal so that the client can extract information from the headers. This method eliminates the need to store header information but still gives the client access to this data. Because the client application is not in the network data path, it does not induce delay into the packets crossing the network switch. Thus, the client application can be arbitrarily complex without affecting TCP Splitters throughput rate.

CP Splitter differs from other TCP/IP network monitors because it

58

IEEE MICRO
Authorized licensed use limited to: IEEE Xplore. Downloaded on April 28, 2009 at 09:52 from IEEE Xplore. Restrictions apply.

is implemented in recongurable hardware, processes packets at line rates exceeding 3 Gbps, can monitor 256,000 TCP ows simultaneously, delivers a consistent byte stream for each TCP ow to a client application, processes data in real time, and eliminates the need for large reassembly buffers. We have successfully tested a sample application in hardware, using simulated TCP data packets. Although we developed the circuit as a module on the FPX platform, TCP Splitter can easily be ported to other FPGA- or ASICbased packet-processing systems. MICRO References
1. L. Roberts, Internet Still Growing Dramatically Says Internet Founder, http://www. caspiannetworks.com/press/releases/ 08.15.01.shtml, Aug. 2001. 2. RFC793: Transmission Control Protocol, http://www.faqs.org/rfcs/rfc793.html, 1981. 3. S. Shalunov and B. Teitelbaum, Bulk TCP Use and Performance on Internet2, http://www.internet2.edu/abilene/tcp/i2tcp.pdf, Aug. 2001. 4. T. Chaney et al., Design of a Gigabit ATM Switch, Proc. Infocom 97, IEEE CS Press, 1997, pp. 2-11. 5. J.W. Lockwood, An Open Platform for Development of Network Processing Modules in Reprogrammable Hardware, Proc. IEC DesignCon 01, Intl Eng. Consortium, 2001, p. WB-19. 6. F. Braun, J.W. Lockwood, and M. Waldvogel, Layered Protocol Wrappers for Internet Packet Processing in Reconfigurable Hardware, Proc. Symp. High-Performance Interconnects (Hot Interconnects IX), IEEE CS Press, 2001, pp. 93-98. 7. V. Jacobson and R. Braden, RFC1072: TCP Extensions for Long-Delay Paths, http:// www.faqs.org/rfcs/rfc1072.html, 1988. 8. A. Gurtov, Effect of Delays on TCP Performance, Proc. IFIP Personal Wireless Communications 2001, Intl Federation for Information Processing, 2001, pp. 87-108. 9. A. Johnson and K. Mackenzie, Pattern Matching in Recongurable Logic for Packet

Classication, Proc. Intl Conf. Compilers, Architectures and Synthesis for Embedded Systems (CASES 01), ACM Press, 2001, pp. 126-130. 10. P. Gupta and N. McKeown, Packet Classication on Multiple Fields, Proc. ACM Sigcomm, ACM Press, 1999, pp. 147-160. 11. F. Baboescu and G. Varghese, Scalable Packet Classication, Proc. ACM Sigcomm, ACM Press, 2001, pp. 199-210. 12. A. Prakash and A. Aziz, OC-3072 Packet Classification Using BDDs and Pipelined SRAMs, Proc. Symp. High-Performance Interconnects (Hot Interconnects IX), IEEE CS Press, 2001, pp. 15-20.

David V. Schuehler is a doctoral student in the Applied Research Laboratory of Washington University in St. Louis. He is also vice president of research and development for Reuters. His research interests include realtime processing, embedded systems, and highspeed networking. Schuehler has a BS in aeronautical and astronautical engineering from Ohio State University and an MS in computer science from the University of Missouri-Rolla. He is a member of the IEEE and the ACM.

The biography of John W. Lockwood appears on p. 9.

Direct questions and comments about this article to David V. Schuehler, Washington University, Applied Research Laboratory, Campus Box 1045, One Brookings Dr., St. Louis, MO 63130; dvs1@arl.wustl.edu.

For further information on this or any other computing topic, visit our Digital Library at http://computer.org/publications/dlib.

JANUARYFEBRUARY 2003
Authorized licensed use limited to: IEEE Xplore. Downloaded on April 28, 2009 at 09:52 from IEEE Xplore. Restrictions apply.

59

You might also like