"Fault Tolerant Techniques for I/O Bound High Performance Systolic Arrays on SRAM FPGA"
Jong-Ho Byun, Ravi Karanam, Arun Ravindran, Arindam Mukherjee and Bharat Joshi
University of North Carolina at Charlotte
Field Programmable Gate Array (FPGA) based reconfigurable computing is increasingly used in data and compute intensive mission critical military and aerospace applications. Execution of signal processing, image processing and bioinformatics codes place substantial demands on the FPGA resources - both in terms of logic and memory for computation, and bus bandwidth for data transfer. In radar signal processing, for example, Space Time Adaptive Processing (STAP) is compute intensive while Synthetic Aperture Radar (SAR) applications are data intensive. Image processing applications such as motion estimation are both compute and data intensive. Bioinformatics applications have not traditionally been considered an application domain for mission critical military and aerospace applications. However, the rapid development of technologies such as lab-on-a-chip opens the possibility of detecting biological agents on the battlefield and in extraterrestrial environments. The concomitant bioinformatics computations such as gene and protein sequence matching are characterized by computations on large databases .
Dependability is a central issue in mission critical systems. Failures in such systems can potentially result in overall system damage, compromise safety, and even lead to loss of life. Commercially available SRAM based FPGAs are prone to Single Events Upsets (SEUs) due to ionizing radiation, besides noise induced transient faults . The continued scaling of transistors to sub 100nm devices further exacerbates this problem. Upsets can occur both in the FPGA configuration data and the on-chip user memory structures, causing changes in both logic and data respectively. Previous studies have investigated the use of hardware redundancy to mitigate SEUs in SRAM FPGAs . However, use of hardware replication techniques such as triple modular redundancy is expensive in terms of logic and memory resources and hence, is not practical for the compute and data intensive systems described above.
In this paper, we present fault tolerant techniques for I/O bound high performance computing on SRAM FPGAs. Systolic array of processors are widely used in implementing signal processing and bioinformatics algorithms on FPGAs. We propose a resource efficient temporal triple modular scheme for fault detection. The results of the first two computing iterations are compared to detect transient faults. In case of transient faults, we use a checkpointing scheme to restart computation from the previous saved checkpoint. If the first two computation results match, we execute the third computation iteration on a different combination of processing elements taking advantage of their identical nature in the systolic array. A disagreement between the first two and third iteration results is an indication of an error in logic due to SEU in the SRAM configuration bits. Fault recovery is achieved by reloading the configuration bits. The ability to partially reconfigure Xilinx Virtex FPGAs is utilized to minimize the reconfiguration time. Checkpoints are also used for recovery from both data and logic errors caused by SEU-induced logic faults. The insertion of checkpoints is optimized between the checkpoint rollback interval, and the storage requirements on the FPGA Block Random Access Memories (BRAMs). We take advantage of the long output latencies in I/O bound applications by overlapping data transfer out of the BRAMs at the end of the first computation iteration, with subsequent computation iterations to detect fault. The proposed techniques are illustrated through a systolic array implementation of a gene sequence matching algorithm and a matrix multiplication algorithm implemented on a PCI based Virtex-II Pro FPGA. An on-chip fault generator is used to simulate fault models. Future work would include extending the proposed fault tolerant strategies to address real time computation constraints.
K. Regester, J. Byun, A. Mukherjee and A. Ravindran, “Implementing bioinformatics algorithms on Nallatech-configurable multi-FPGA systems”, XCell Journal, Issue 53, pp. 100 – 103, Second Quarter 2005.
P. Graham, M. Caffrey, J. Zimmerman, and D.E. Johnson, P. Sundararajan and C. Patterson, “Consequences and categories of SRAM FPGA configuration SEUs,” MAPLD 2003.
P.K. Samudrala, J. Ramos, and S. Katkoori, “Selective triple modular redundancy for SEU mitigation in FPGAs,” MAPLD, 2003.
2006 MAPLD International Conference Home Page