PROGRAMMABLE TECHNOLOGIES WEB SITE

A scientific study of the problems of digital engineering for space flight systems,
with a view to their practical solution.

 

Efficiently Supporting Fault-Tolerance in FPGAs


John Lach
UCLA EE Department
56-125B Engineering IV
Los Angeles, CA 90095
(310) 794-1630
jlach@icsl.ucla.edu

William H. Mangione-Smith
UCLA EE Department
56-125B Engineering IV
Los Angeles, CA 90095
(310) 206-4195
billms@icsl.ucla.edu

Miodrag Potkonjak
UCLA CS Department
4532K Boelter Hall
Los Angeles, CA 90095
(310) 825-0790
miodrag@cs.ucla.edu

Abstract

While system reliability is conventionally achieved through component replication, we have developed a fault-tolerance approach for FPGA-based systems that comes at a reduced cost in terms of design time, volume, and weight. We partition the physical design into a set of tiles. In response to a component failure, we capitalize on the unique reconfiguration capabilities of FPGAs and replace the affected tile with a functionally equivalent tile that does not rely on the faulty component. Unlike fixed structure faulttolerance techniques for ASICs and microprocessors, this approach allows a single physical component to provide redundant backup for several types of components. Experimental results conducted on a subset of the MCNC benchmarks demonstrate a high level of reliability with low timing and hardware overhead.

Table of Contents

    1. Abstract
      1. Keywords
    2. Introduction
      1. Motivation
      2. Motivational Example
      3. Paper Organization
    3. Preliminaries
      1. FPGA Architecture Model
      2. Fault Model, Testing and Diagnosis
    4. Related Work
    5. Approach
      1. Tiles and Atomic Fault-Tolerant Blocks
      2. Synthesis Methods
      3. Enforcing Fault-Tolerance at Run-time
    6. Reliability Calculation
      1. Independent Uniformly Distributed Faults
      2. Stapper’s Fault Model
    7. Experimental Results
    8. Future work
    9. Conclusions
    10. Acknowledgments
    11. References

List of Figures

Figure 1. Motivational Example

Figure 2. A 6x6 CLB design partitioned into 4 3x3 tiles

Figure 3. Initial floorplan for PREP 5 benchmark

Figure 4. PREP benchmark 5 after tiling with one AFTB

Figure 5. System at runtime after swapping the AFTB in tile B due to fault at (20,3)

Figure 6. Reliability of traditional methods vs. tiled methods for a hypothetical 5000 CLB FPGA

List of Tables

Table 1. Timing bounds due to routing variation among AFTBs for each tile

Table 2. Variation of resources used among AFTBs for each tile

Table 3. Reliability of the original vs. tiled designs against CLB reliability

Table 4. Reliability of original and tiled designs using Stapper’s correlated failure model with CLB reliability of 90%/99%

Table 5. Comparison of reliability and overhead for original design with complete redundancy (i.e. 100% overhead) vs. tiled design for CLB reliability of 90% and µ = 20.

Table 6. Reliability of traditional design methods vs. tiled approach against CLB reliability for large FPGAs

Conclusions

Fault-tolerant techniques have recently emerged as an important design consideration for FPGA-based systems due to the rapid progress in FPGA integration and the growing market for these devices. In order to address this problem, we have developed the first fault-tolerance approach to work at the level of physical design. Our hierarchical fault-tolerance technique partitions designs into tiles and atomic fault-tolerant blocks. The approach scales systematically through an exploration of the design solution space at the physical level. The approach is constructed of four phases: design partitioning, tile partitioning and ordering, AFTB partitioning and ordering, and reliability calculation.

Experimental results conducted on a subset of the MCNC benchmarks for large CLB FPGAs indicate that the technique is effective with low hardware overhead.


Home
Last Revised January 09, 2002
Digital Engineering Institute
Web Designer: Richard Katz