Mapping Networking Applications To Multiprocessor-FPGA Configurable Computing Systems Siva Subramanian, Nortel Networks Clay S. Gloster, Jr., Howard University * Introduction Configurable Computing (CC) systems use Field Programmable Gate Arrays (FPGAs) to accelerate compute-intensive applications on general purpose processors. Networking applications are typically compute-intensive and require low initiation intervals in addition to small execution times. Network Processing Units (NPU) are integrated multiprocessors that have been optimized for networking applications. Due to their simplified bit-oriented architecture, NPUs exhibit reduced performance when applications require increased processing power per packet. Line-rate processing of complex-operation networking applications such as load-balancing, compression, application firewalls, or intrusion-detection requires a different approach to meet their high performance constraints. These applications can benefit from a methodology that combines the benefits of configurable computing with the multiprocessor features of network processors. Until recently, solutions using multiple processors and FPGA devices were impractical in terms of power, area, cost and development time. Recent advances in Very Large Scale Integration (VLSI) technology have resulted in new high-density FPGA architectures with multiple embedded processors. Such highly integrated architectures enable practical solutions for line-rate processing of complex networking applications. Existing methods of mapping applications to configurable computing systems are limited to architectures with a single general purpose processor and conventional FPGAs. Very little research has been published that addresses mapping of networking applications to multiprocessor FPGA systems. This paper addresses this problem by proposing a methodology for mapping networking applications to multiprocessor-embedded FPGA systems. It presents an innovative architecture that uses multiprocessor pipelining and interleaving concepts along with configurable computing concepts to create a Configurable Application Pipeline (CAP). CAPillary, an algorithm for generating CAP solutions for a given networking application, is presented along with examples and results that demonstrate the effectiveness of the proposed methodology. * Previous Research FPGA-based Custom Computing Machines (FCCMs) are highly flexible and offer extremely high performance for tasks such as image processing, pattern matching, object tracking, cryptography etc. [Vi98, Vu96, Ab94, At95, Ho93, Fi98]. The main concept of an FCCM is to map the computationally intensive portions of an application to FPGA resources and execute the remaining code on a general purpose processor. Line-rate processing of networking applications at high networking speeds (multi-gigabits-per-second) require the use of multiple hardware-assisted processing engines. The target architectures for this research, therefore, are multi-processor systems tightly coupled to FPGAs. Only recently, the research community has attempted the use of configurable computing concepts for networking applications [Le99, Sl00, Lo01, Ta01, Be02, Br02, Fr02, Ku02, Me02, Sp02]. A majority of these efforts focus on hardware-only implementations of networking functions [Le99, Lo01, Ta01, Be02, Fr02, Me02]. The research in [Ku02] is focused more on the networking platform architecture and applications and the work in [Sp02] looks at operation of the networking platform in a network system. The FPGA system used within the system described in [Ku02] is also presented in [Lo01, Ta01]. While some of these approaches [Sl00, Sp02, Br02] use both CPUs as well as FPGAs, they do not focus on the hardware-software partitioning problem nor do they propose a methodology that can be used for a range of applications. In short, they are point solutions to specific networking application performance problems. This is not surprising given that implementing networking applications on configurable computing systems is a relatively new area that has seen activity only in the last 2-4 years. Some of these projects are described in more detail below. The design of a high-performance dynamically extensible router is presented in [Lo01, Ta01, Ku02, Sp02]. This system architecture uses FPGA based port processors which are named the FPX (Field Programmable Port Extender) [Ta01]. Each of the line cards used in the dynamically extensible router has access to a port processor which consists of one or more CPUs on a smart port card (SPC) and an optional FPGA based FPX module. In [Fr02], Franklin et. al. tackle the specific payload processing networking application of Network Intrusion Detection. Their contribution is a FPGA soft hardware core designed to offload the string matching operation required in the intrusion detection application. Another effort at using a FPGA-based co-processor for payload processing networking applications can be found in [Me02]. The authors present a novel design for a generic hardware accelerator module that is capable of multiple tasks such as tree-lookup and pattern matching. Their results show up to 20x speedup over software implementations of these tasks. All of the above research focuses on design of specific hardware modules rather than the application mapping process. They do not deal with the system level application partitioning or mapping issues. Also the focus of these papers is limited to specific functions within an application that can be implemented in reconfigurable hardware. A slightly different approach is used by Brebner [Br02] to implement a multi-processor based mixed-version IP (Internet Protocol) router on a Xilinx Virtex-II Pro device. This is the only research effort that has taken advantage of the new multi-processing capabilities of SoC (System on a Chip) FPGA architectures. The goal of the work was specifically to demonstrate a complete system on an FPGA chip by using the features of the Xilinx Virtex-II Pro device. No effort is made to partition any of the tasks of the router application to hardware. Extensive background research presented in this paper demonstrates that the configurable computing (CC) research community has only recently begun to investigate the use of CC concepts into complex networking applications. Also, little research has been done based on multiprocessor configurable computing architectures. To the best of the authors knowledege, no research has been published that maps networking applications to multiprocessor-embedded FPGA systems. * Configurable Application-specific Pipelines: Architecture & Methodology The use of configurable computing, while providing a boost in performance compared to software implementation, often results in poor utilization of the FPGA. This is primarily due to uneven distribution of tasks on the CPU and FPGA and can vary based on the application and the tasks partitions.The performance and efficiency of the configurable computing CE module can be improved by applying some concepts found in network processing architectures. Pipelining & Interleaving: The throughput of a task can be increased using finer pipelining or interleaving. In pipelining, the doubling of performance is achieved by dividing task A into two parts, each with half the execution time of task A, and adding a register between them. In contrast, the interleaving method does not require alteration of the task itself. Interleaving achieves the doubling in performance by replicating the task. A demultiplexer is required at the input and a multiplexor is required at the output. Data-flow programming: Data-flow programming refers to the division of an application program into subtasks that are connected by data-flow paths. Each subtask can then be mapped to a CPU or FPGA device. The data-flow architectural concept is an old concept that has been applied to DSP problems in the past in synchronous data flow systems [Le287][Le387] and block data-flow systems [Al96]. Multiple IO buses: By decoupling the input and output buses used for inter-task communication, the throughput of the system can be greatly increased. This also enables pipelining of the functional units (CPUs and FPGAs). Use of the separate buses for inter-task communication also eliminates the need for scheduling as the bus is no longer a shared resource. The concepts described above can be used to improve performance of configurable computing systems for a range of applications. The overhead involved in doing so are the memory buffers between each stage of the pipeline which store the output of a stage until the next stage is ready to process the data. Advances in VLSI technology have made it possible to tightly integrate such a configurable architecture with memory onto a single device. FPGA devices with embedded hard processor cores, such as the Xilinx Virtex2Pro family, are ideally suited for such multiprocessor configurable computing pipelines. A pipelined implementation of an application using multiple processors and hardware modules in a FPGA-based configurable computing system is known as a Configurable Application-specific Pipeline (CAP). The focus of this research is a methodology for mapping networking applications to multiprocessor-embedded FPGA configurable computing systems. In particular, the research aims to take advantage of advanced features of next-generation FPGA devices with System-on-a-Chip (SoC) architectures. The target system is assumed to have an array of such FPGAs interconnected by high-speed buses or serial links. Multiple applications can execute on the system simultaneously. However, an FPGA device cannot be shared between two applications. The target system is characterized by several attributes including the number of FPGA devices, the number of CPUs per FPGA device, the number of serial links between FPGA devices, and the number of logic blocks within each FPGA device. It is assumed that all FPGA devices on the target system are of the same type i.e. have the same attributes. The methodology enables system designers to map a networking application on to a multiprocessor-embedded FPGA system. The algorithm presented in this paper, specfically targets configurable computing systems based on multiprocessor-embedded FPGAs. The algorithm is designed to take advantage of hard processor cores embedded within such FPGAs as well as soft processor cores which leads to better utilization of the configurable logic within the FPGA devices. Given a coarse-grained acyclic data flow graph for a networking application, a hardware library, a software library, a performance target and architectural constraints for the multiprocessor-embedded FPGA system, the proposed methodology determines the hardware-software selection on a per-task basis, the number of hardware resources required, the number and type of software resources required and configuration of the CAP such that, the desired performance is achieved, the number of FPGAs in the solution is minimized and the architectural constraints of the target system are satisfied. The proposed methodology transforms the application DFG into a CAP by assigning each task to a physical resource. It then iteratively modifies the CAP to meet the target performance and reduce the resource utilization. It is important to note that the methodology only pipelines the application tasks but does not partition the tasks. The designer is expected to partition the tasks into modules that will meet the timing requirements of the CAP. Thus each task occupies a stage of the pipeline and accepts inputs from preceding tasks and generates outputs for its succeeding tasks. The fundamental premise of the CAP is that each node represents one or more functions that can be invoked whenever input data is available. This concept has been used in data flow architectures for digital signal processing applications in [Al96]. Thus, execution of the CAP is controlled by the availability of data, i.e the CAP is data-driven. A task assigned to a resource executes only when data is available and there is sufficient room in its output buffers. Each node of the CAP is assumed to have sufficient output buffers to store the output data of a processed block. If the output buffers are full, then the pipeline stalls and each preceding stage of the pipeline stops functioning until the congestion is cleared. The CAP application, thus executes on a block-by-block basis. Block processing is a characteristic attribute of networking applications which makes them amenable for CAP implementations. * Results & Conclusion The CAPillary algorithm, which is part of the proposed methodology is aimed at harnessing the power of future generation FPGA architectures, the first of which are appearing in the industry today. Specifically, the algorithm has been designed to take advantage of hard processor cores embedded within FPGA devices as well as new soft processor cores which have been enabled due to the feasibility of multi-million gate FPGA devices. In addition to the methodology and algorithm this paper presents an innovative architecture that combines features of pipelined, interleaved multiprocessor systems with configurable computing concepts towards the creation of a Configurable Application Pipeline (CAP). The CAP architecture can be used to implement networking applications on multiprocessor-embedded FPGA devices. It allows designers to effectively utilize the hard processor cores within a pipeline. Without the proposed innovative architecture, methodology and algorithm, the designer is left with the choices of manually exploring ad-hoc solutions for mapping applications to multiprocessor-embedded FPGA devices or using conventional methods with a single processor and traditional FPGA logic. The former requires significant amount of expertise and development effort on part of the designer and the latter has been shown to be inefficient for networking applications at high throughputs. Use of the CAP architecture and the proposed methodology leads to reduced development effort, rapid design exploration and efficient utilization of multiprocessor-embedded FPGA devices in implementation of networking applications. The effectiveness of the CAPillary algorithm in generating solutions using the CAP architecture has been demonstrated by two real world applications, Network Image Compression and Network Address Translation, using the methodology to implement real networking applications. The solutions derived from using the proposed methodology use the CAP architecture. The CAP-based solutions are compared with software-only solutions and also conventional configurable computing solutions for a range of throughputs. Conventional configurable computing techniques, which employ a single processor and one or more FPGA devices, are inefficient in utilization of the devices as well as in the number of devices required to meet the given throughput goals. In comparison, the CAP solutions require low device counts and provide better utilization of the processor and FPGA devices. The results for the two real-world applications show that for high throughput goals, CAP-based solutions can result in up to 50% reduction in the number of devices compared to conventional configurable computing methods. * References [Ab94] A. L. Abbott et al., Finding lines and building pyramids with Splash 2, Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, April 1994, pp. 155-161. [Al96] W. E. Alexander, D. Reeves and C. Gloster Jr., Parallel Image Processing with the block data parallel architecture, Proceedings of the IEEE, vol. 84, no. 7, 1996, pp. 947-968. [At95] P. M. Athanas and A. L. Abbott, Real-time image processing on a custom computing platform, IEEE Computer, 28(2), Feb 1995, pp. 16-24. [Be02] P. Bellows, V. Bhaskaran, J. Flidr, T. Lehman, B. Schott and K. D. Underwood, GRIP: A Reconfigurable Architecture for Host-Based Gigabit-Rate Packet Processing, IEEE Symposium on Field Programmable Custom Computing Machines (FCCM’02), April 2002. [Br02] G. Brebner, Single-chip Gigabit Mized-version IP Router on Virtex-II Pro, Proceedings of the 10th Annual IEEE Symposium on Field Programmable Custom Computing Machines (FCCM’02), April 2002. [Fi98] M. Figuiredo and C. Gloster, Implementation of a probabilistic neural network for multi-spectral image classification on an FPGA based custom computing machine, Proceedings of the Vth Brazilian Symposium on Neural Networks, pp.174 -179, Dec 1998. [Fr02] R. Franklin, D. Carver and B. L. Hutchings, Assisting Network Intrusion Detection with Reconfigurable Hardware, Proceedings of the 10th Annual IEEE Symposium on Field Programmable Custom Computing Machines (FCCM’02), April 2002. [Ho93] D. T. Hoang, Searching genetic databases on Splash 2, Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, Apr 1993, pp. 185-191. [Ku02] F. Kuhns et.al., Design of a High Performance Dynamically Extensible Router, Proceedings of the DARPA Active Networks Conference, DANCE 2002. [Le287] E. A. Lee and D. G. Messerschmitt, Pipeline Interleaved Programmable DSPs: Architecture IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP-35, No. 9, pp. 1334-1345, September 1987. [Le387] E. A. Lee and D. G. Messerschmitt Synchronous Data Flow, Proceedings of the IEEE, Vol. 75, No. 9, pp. 1235-1243, September 1987. [Le99] D. C. Lee, S. J. Harper, P. M. Athanas and S. F. Midkiff, A Stream-based Reconfigurable Router Prototype, Proceedings of the IEEE Symposium on Field Programmable Gate Arrays for Custom Computing Machines (FCCM), April 1999. [Lo01] J. W. Lockwood, N. Naufel, J. S. Turner and D. E. Taylor, Reprogrammable Network Packet Processing on the Field Programmable Port Extender(FPX), Proceedings of FPGA 2001, ACM, February 2001. [Me02] G. Memik, S. O. Memik and W. H. Mangione-Smith, Design and Analysis of a Layer Seven Network Processor Accelerator Using Reconfigurable Logic, IEEE Symposium on Field Programmable Custom Computing Machines (FCCM’02), April 2002. [Sl00] F. Slomka, M. Dorfel, R. Munzenberger and R. Hoffman, Hardware/Software Codesign and Rapid Prototyping of Embedded Systems, IEEE Design and Test of Computers, pp. 28-38, April-June 2000. [Sp02] T. S. Sproull, J. W. Lockwood and D.E. Taylor, Control and Configuration Software for a Reconfigurable Networking Hardware Platform, IEEE Symposium on Field Programmable Custom Computing Machines (FCCM’02), April 2002. [Ta01] D. E. Taylor, J. S. Turner and J. W. Lockwood, Dynamic Hardware Plugins (DHP): Exploiting Reconfiurable Hardware for High-Performance Programmable Routers, IEEE Openarch 2001, pp. 25-34. [Vi98] J. Villasenor and B. Hutchings, The Flexibility of Configurable Computing, IEEE Signal Processing Magazine, September 1998, pp. 67-84. [Vu96] J. Vullemin, et.al. Programmable Active Memories: Reconfigurable systems come of age, IEEE Transactions on VLSI Systems, Vol. 4, No. 1, pp.56-69, 1996.