FEASIBILITY OF FLOATING-POINT ARITHMETIC IN RECONFIGURABLE COMPUTING SYSTEMS Ibrahim Sahin, Clay Gloster, and Christopher Doss ECE Department NC State University 1. Abstract Reconfigurable Computing (RC) has emerged as a viable computing solution for computationally intensive applications. Applications mapped to RC systems include image processing algorithms, pattern recognition in high energy physics and genetic optimization algorithms. Due to the hardware complexity of the floating point modules and limited resources available in prior RC systems, applications that required floating point operations were either, not mapped to RC systems, or converted to fixed point before developing the RC implementation. Recent advances in Field Programmable Gate Array (FPGA) technology offer the user more hardware resources on a single FPGA device and thus the greater potential to develop complex RC systems. In this paper, the feasibility of mapping applications containing floating point operations to RC systems is presented. Three floating point modules: vector addition, subtraction, and multiplication were modeled using VHDL and mapped to a Xilinx XC4044XL FPGA device. These modules are highly pipelined and optimized for both speed and area. Our results verify that floating point applications are feasible and that significant speedup can be obtained when mapping these applications to RC systems. 2. Introduction Adaptive computing, also known as reconfigurable computing (RC), is a combination of hardware/software data processing platforms that include a general purpose processor and one or more FPGA devices. These RC systems combine the flexibility of general purpose processors with the speed of application specific processors. In a typical reconfigurable computer, computationally intensive portions of algorithms are executed on FPGA devices for enhanced performance. A well designed and utilized adaptive computer could yield 10X to 1000X improvement in execution time over conventional general purpose processor based "software only" computers. Several applications have been mapped to reconfigurable computers to demonstrate the viability of RC systems. Applications mapped to these systems include image processing algorithms, genetic and VLSI optimization algorithms and multimedia applications. In most cases, the reconfigurable computing system provided the smallest published execution time for these applications. The hardware resources available on the FPGAs used in RC are limited; therefore, not all applications can be efficiently mapped to these systems. This is especially true of applications in which floating point arithmetic operations are needed, due to the large amount of resources required by floating point units. As a result, application developers either avoid implementing these applications in RC systems, or convert the floating point operations to fixed point operations in order to reduce the amount of hardware resources required. Recent advances in FPGA technology have opened new doors for system developers. The size and clock speed of FPGA devices have increased significantly. With today's technology, 3 million logic gates can be implemented on a single FPGA device, and clocked at speeds greater than 300 MHz. These improvements give us the opportunity to implement more complex applications, including those that require floating point arithmetic. In a recent study, we implemented several floating point modules in VHDL and mapped them to a Xilinx XC4044XL FPGA to perform IEEE Standard 745 floating point addition, subtraction and multiplication. We used the Annapolis Micro Systems Wildforce reconfigurable computing system to test the modules. Results demonstrate that floating point applications can be implemented on RC systems, with significant speedup over a general purpose processor implementation. These systems are also easier to debug since conversions between floating point and integer formats are not required. 3. Floating Point Modules All of our modules are designed to work on two equal length vectors. Each module is able to handle instructions to process one or more vector pairs. Each instruction corresponds to a single floating point vector operation. A standard instruction includes three addresses. The first address is the starting address of the input vector pair. Elements of the vector pair are interleaved or stored in consecutive memory locations as follows; A0, B0, A1, B1, A2, B2, ... AN, BN. The second address is the starting address of the output vector and the third address marks the end of the input vector data. The modules were tested on the Annapolis Micro Systems Wildforce board. This board includes five Xilinx XC4044XL FPGA devices or Processing Elements (PEs). Each PE has its own dual ported local memory (1M Byte). The host computer and the PE have read and write access to the local memory. We partitioned the PE memory into two sections, instruction and data. The instruction section always starts from the address $00000 and ends with the HALT instruction ($FFFFF). The remaining memory that is not used for instructions is used for data. Once a module configuration has been loaded into a PE and the local memory has been initialized, the module waits for the reset signal to be asserted. When this occurs, the module reads the first instruction from memory location $00000. It then begins executing the instruction. When the current instruction has completed, the module reads the next instruction from the instruction memory. This process continues until the module reads a HALT instruction ($FFFFF) from the instruction memory. When this value is read, the module sends an interrupt signal to the host computer and stops processing. Modules are able to produce one result every 3 clock cycles. 4. Design Structure of the Modules The modules that we developed in this study are of a fixed structure. Each module has a standard controller and a standard data path containing a unique core unit for each floating point operation. All modules have the same latency and can be clocked at the maximum frequency allowed by the RC system board we used. The data path includes several registers to hold both instructions as well as data. The most important component of the data path is the floating point arithmetic core. For each module, a unique core is instantiated in the data path. The vector addition and subtraction units share the same core, while the vector multiplier uses a different core. To facilitate future automation of core synthesis, a standard interface was used for all cores. That is, all cores have the same input and output terminals, with equal pipeline depths. All floating point operations are divided into 9 pipeline stages in order to obtain 100% utilization of the memory bus. One unique feature of the cores is the standard interface control signals. Each core has two inputs and one output for control and interfacing. When each data value is read from the memory, the controller asserts the INPUT_DATA_READY signal corresponding to the core input that has valid data. When both inputs to the core have valid data, the core begins the floating point operation. When the core is finished processing data,the core asserts the OUTPUT_DATA_READY signal. The controller then stores the result in memory. Use of the standard interface control signals serves two purposes. The main purpose is to reduce controller complexity and to increase controller flexibility. Hence, a single controller can handle future cores with arbitrary latencies. The other purpose is to facilitate the incorporation of complex cores into the system. The use of the standard interface control signals makes it is easy to form larger cores by simply linking existing cores together. A single controller is used for all vector operations presented. It does not control each stage of the core. Instead, it uses the interface signals to signal the core that the input data is ready. It also uses the OUTPUT_DATA_READY signal produced by the core to determine when the result is ready. This simplification in the controller saves control states, logic gates, and future application development time. 5. Experimental Results 5.1 Device utilization Table 1 shows the resulting device utilization and maximum clock speed for each module. These values are collected after module placement and routing has been completed for the Xilinx XC4044XL FPGA device. +--------------------+-----------+--------+----------------+ |Module Name | CLB Util. | % Util | Max Speed (MHz)| +--------------------+-----------+--------+----------------+ |FP Adder/Sub | 451 | 28 | 29.70 | | Data Path Only | 378 | 24 | 42.00 | | Add/Sub Core Only | 316 | 20 | 42.90 | |FP Multiplier | 969 | 60 | 27.35 | | Data Path Only | 893 | 56 | 33.00 | | Mult. Core Only | 834 | 53 | 34.50 | |Controller | 73..76 | ~4.6 | --.-- | +--------------------+-----------+--------+----------------+ Table 1: Device utilization and maximum clock speeds. The adder and the subtractor modules use 28% of an FPGA device. This means that three adder/subtractor modules can fit into one FPGA device. On the other hand, since the adder/subtractor core itself takes only 20% of the device, five cores can fit into one FPGA device. Since the board that we are using has 5 FPGA devices, a total of 25 adder/subtractor cores can be utilized on the board. The complete multiplier module requires 60% of an FPGA device, with the core requiring 53% of an FPGA device. Only one multiplier module can fit into one FPGA. Therefore, a total of five multipliers can be utilized on the board simultaneously. 5.2. Clock frequency and execution time The clock frequencies shown in Table 1 are the values indicated by the design tools. Modules are tested at these speeds and they behaved as expected. However, we over-clocked the modules to 50 MHz, the maximum clock speed supported by the FPGA board. All the modules worked properly at 50 MHz. Table 2 shows the execution times of the modules, along with the regular C++ implementations running on a Pentium II 300 MHz processor based PC. In these experiments, the modules are clocked at 50 MHz. The length of each input vector was 131,000 requiring 232,000 words of memory for storage. Hence a total of 131,000 floating point operations were performed. Since all the modules have exactly the same latency, the execution time is identical for all modules. +----------------+--------------+------------+ +-------------+-----------+ |Operation Type |Exec. time on |On the host | |# of modules |Exec. Time | | |the board with|computer C++| |used |(msec) | | |modules (msec)|impl. (msec)| +-------------+-----------+ +----------------+--------------+------------+ |1 module | 10.80 | |1 Add/Sub Module| 10.80 | 10.05 | |2 modules | 5.40 | |1 Multip. Module| 10.80 | 10.36 | |3 modules | 3.60 | +----------------+--------------+------------+ |4 modules | 2.70 | Table 2: Comparison of the execution times. |5 modules | 2.16 | +-------------+-----------+ Table 3: Execution times when multiple modules are used. Table 3 shows the execution times when more than one module is utilized simultaneously. As the number of modules used increases, the execution time decreases. When five modules are utilized, the modules perform the same number of floating point operations almost five times faster than a Pentium II 300 MHz processor. 7. Conclusions In this study, we investigated the feasibility of using floating point arithmetic in RC systems and presented results comparing the system to a general purpose processor. CLB utilization of the modules demonstrated that applications including floating point operations can be mapped to current RC systems. Future RC systems will only have increased FPGA resources and hence can accomodate many more floating point resources. The results indicate that floating point modules can achieve speedups of a factor of 5 over a typical desktop computer when the modules are utilized in parallel. Results of this study will be used in the development of future design automation tools with the goal of facilitating RC system development while maintaining enhanced performance.