A scientific study of the problems
of digital engineering for space flight systems,
with a view to their practical solution.
Low Overhead Fault-Tolerant FPGA Systems
John Lach, William H. Mangione-Smith, Miodrag Potkonjak
jlach@icsl.ucla.edu, billms@icsl.ucla.edu, miodrag@cs.ucla.edu
56-125B Engineering IV
University of California
Los Angeles, CA 90095
Abstract
Fault-tolerance is an important system metric for many operating environments, from automotive to space exploration. The conventional technique for improving system reliability is through component replication, which usually comes at significant cost: increased design time, testing, power consumption, volume, and weight. We have developed a new fault-tolerance approach that capitalizes on the unique reconfiguration capabilities of FPGAs. The physical design is partitioned into a set of tiles. In response to a component failure, a functionally equivalent tile that does not rely on the faulty component replaces the affected tile. Unlike ASIC and microprocessor design methods, which result in fixed structures, this technique allows a single physical component to provide redundant backup for several types of components. Experimental results conducted on a subset of the MCNC benchmarks demonstrate a high level of reliability with low timing and hardware overhead.
Table of Contents
List of Figures
Figure 1. Motivational Example
Figure 2. A 6x6 CLB design partitioned into 4 3x3 tiles
Figure 3. Initial floorplan for PREP 5 benchmark
Figure 4. PREP 5 after tiling with one AFTB identified
Figure 5. System at runtime after swapping the AFTB in tile B due to fault at (20,3)
Figure 6. Reliability of traditional methods vs. tiled methods for a hypothetical 5000 CLB FPGA
List of Tables
Table 1. Timing bounds due to routing variation among AFTBs for each tile
Table 2. Variation of resources used among AFTBs for each tile
Table 3. Reliability of the original vs. tiled designs against CLB reliability
Table 4. Reliability of original and tiled designs using Stappers correlated failure model with CLB reliability of 90%/99%
Table 5. Comparison of reliability and overhead for the original design with complete redundancy (i.e. 100% overhead) vs. tiled design for CLB reliability of 90% and µ = 20.
Table 6. Reliability of traditional design methods vs. tiled approach against CLB reliability for large FPGAs
Conclusions
Fault-tolerant techniques have recently emerged as an important design consideration for FPGA-based systems due to the rapid progress in FPGA integration and the growing market for these devices. In order to address this problem, we have developed the first fault-tolerance approach to work at the level of physical design. Our hierarchical fault-tolerance technique partitions designs into tiles and atomic fault-tolerant blocks. The approach scales systematically through an exploration of the design solution space at the physical level. The approach is constructed of four phases: design partitioning, tile partitioning and ordering, AFTB partitioning and ordering, and reliability calculation.
Experimental results conducted on a subset of the MCNC benchmarks for large CLB FPGAs indicate that the technique is effective with low hardware overhead.
Home
Last Revised February 03, 2010
Digital Engineering Institute
Web Designer: Richard Katz