A scientific study of the problems
of digital engineering for space flight systems,
with a view to their practical solution.
Presented at the 1999 MAPLD International Conference
September, 1999
Laurel, MD
R. Katz1, R. Barto2, and K. Erickson3
1NASA Goddard Space Flight Center, Greenbelt, MD 20771
2Spacecraft Digital Electronics, El Paso, TX 79904
3Jet Propulsion Laboratory, Pasadena, CA 91109
Abstract
Logic design errors have been observed in space flight missions and the final stages of ground test. The technologies used by designers and their design/analysis methodologies will be analyzed. This will give insight to the root causes of the failures. These technologies include discrete integrated circuit based systems, systems based on field and mask programmable logic, and the use computer aided engineering (CAE) systems. State-of-the-art (SOTA) design tools and methodologies will be analyzed with respect to high-reliability spacecraft design and potential pitfalls are discussed. Case studies of faults from large expensive programs to "smaller, faster, cheaper" missions will be used to explore the fundamental reasons for logic design problems.
Table of Contents
I. Introduction
II. Overview of the Paper
III. Low-level Hardware
A. Special Pins and Pin Terminations
1. Actel MODE Pin
2. Actel SDI, DCLK Pins
3. Actel VPP Pin
4. Xilinx MODE Pins
5. Unused Inputs
6. IEEE JTAG 1149.1 Pins
B. Clock Skew
1. Definitions and Discussion of Terms
2. Circuit Analysis with Clock Skew
3. Chip-to-Chip Applications
C. Start-up
1. FPGA Outputs
2. Power-on Reset (POR) Detector
3. Synchronous Reset
D. Metastable States
1. Introduction
2. Example - MTBF Calculation
3. Example - Inadequate Synchronizer Circuit
E. Asynchronous Circuits
1. Introduction
2. Asynchronous Clear
3. Asynchronous Decoding - Terminal Count
4. Asynchronous Decoding - Binary Counter
F. Interfacing Logic Blocks
G. Interfacing Voltage Margin
H. TMR Structure
I. Races
IV. Low-level Software
A. Clock Skew Revisited
1. Elimination of Buffers
2. CAE Software Generating a Clock Tree
B. VHDL "Interfaces"
C. High-level Design Flow
1. Case Study 1 - Memory Controller.
2. Case Study 2 - Motor Controller
D. Replication and Timing Optimization
E. Robust Design and Lockup States
F. Correct by Construction
G. Delay Generation
V. High Level Principles, Technology
A. Specifications - General Principles
B. Specifications - Gate Array Operation Incorrect
C. Specifications - FPGA Updates Delays Projects
D. Reliance on Simulators - General Principles
E. Simulators - Case Study 1
F. Simulators - Case Study 2
VI. Conclusion
List of Figures
- Quad-redundant AND gate from the Orbiting Astronomical Observatory. An inverter used 8 transistors.
- Overview of a typical Actel device showing array and special registers. Holding the MODE pin low keeps the special registers in a safe, operational, state.
- Close up view of a typical Actel device showing logic modules, routing resources, antifuses, and pass transistors used for test and programming operations.
- Charge pump and isolation FETs for an A1020 FPGA. The isolation FETs protect the transistors from damage during the programming operation, when high voltage is applied to the selected antifuse. The charge pump puts out voltage to bias the nFETs on during normal operation. During the startup transient, the device is not guaranteed to follow its truth table. For the case of an I/O module, pins configured as inputs or outputs can source current during the transient.
- User circuit with a "JTAG"-compliant device without the optional TRST* pin. An external clock is used as input to TCK to ensure that the TAP controller enters the TEST-LOGIC-RESET state after five TCK cycles. Logically correct, this circuit has several flaws.
- Block diagram of the control of a JTAG-compliant I/O module. The JTAG data path is a shift register under the control of the TAP controller. If the TAP controller is taken out of the TEST-LOGIC-RESET state by a heavy ion, garbage values may be applied to the control of the I/O module. This may result in changing a system-level input pin into an output pin.
- Shift register clocked by a local routing resource. The use of a high-skew clock can result in circuit failure from hold time violations.
- Long shift register with clock load driven by a clock tree. This structure in an FPGA adds clock skew from different delays to the buffers, buffer delays, and buffer loads. Buffer loads include not only the clock inputs, but the resistance of the connection elements and the amount of capacitance from a variable length routing segment.
- Circuit used for analysis of clock skew timing using local routing. A late arriving clock at FF2 can cause a hold time violation and circuit failure. A model for the circuit topology shown in Figure I would include the skew introduced by the extra buffer and the routing of the signal to each of the buffers.
- ONO antifuse resistance distribution for a 5 mA programming current. The worst-case analysis must take into account the range of antifuse resistance when doing min-max calculations. Antifuse resistance varies from fuse to fuse and does not "track" over a lot of parts.
- Amorphous silicon antifuse resistance distribution for a 0.65 µm technology.
- Attempt to eliminate clock skew problem. Careful timing analysis is needed to ensure that the added buffer will have sufficient delay to guarantee adequate hold time. Circuits using this configuration have failed because of excessive clock skew.
- Clock skew between chips. Depending on the device selected, the design, and the place and route, this circuit may not guarantee proper hold times. Some architectures use dedicated I/O flip-flops and clocks to ensure hold times are met while others achieve this by circuit design. The architecture may have a programmable delay element (Xilinx) in the data path that is enabled for sequential inputs for reliability and disabled for high-speed combinational inputs. Lastly, with some devices, parallel clocking may not be reliable and the opposite clock edge, or a different clock, should be used.
- FPGA output start-up transient interfacing to a critical system. The behavior of an FPGA during the start-up transient can be either controlled or uncontrolled, depending on the architecture of the device and specific circuit design. A transient output can cause critical events to happen, such as switching relays, firing pyrotechnic devices, etc.
- Startup characteristics of an A1020 FPGA with a power supply rise time of 20 msec (10% to 90%) after being powered off for 24 hours. The horizontal scale is 4 msec per division. Cover and Arm are outputs at 2V/Div and are driving high in this test run. VCC is scaled at 5V/Div.
- A typical Power-on reset circuit. Considerations for good circuit design include a time constant long enough to allow crystal oscillators and FPGAs to start, protection against capacitor discharge through the ESD protection diodes, fast response to voltage drops, transition time limitations of the inputs, and knowledge that some FPGAs may source current during the start-up transient. Additionally, many FPGA types will not respond to their inputs while they are being configured or are "starting."
- Synchronous reset topology. This circuit structure minimizes sensitivity to noise and glitches by avoiding the use of asynchronous inputs on the flip-flop. It will not start assert the reset until both the FPGA and the oscillator have "started."
- Flight oscillator start time samples. The effect of power supply rise time on the start time of these Class S oscillators is easily seen.
- Summary of start time performance of a Class S flight 200 kHz oscillator as a function of power supply rise time. There exists, for this oscillator, a linear function between power supply rise time and start time.
- MTBF as a function of available slack time. System reliability is an exponential function of the settling time made available for metastable state resolution.
- Inadequate synchronizer circuit design. EVENT is asynchronous to SYSCLK resulting in possible metastable behavior. The output of the synchronizing flip-flop (FF2) may clear the latching flip-flop (FF1) before FF2 is stable, resulting in a loss of the incoming event.
- Improper use of asynchronous clears. The output of this circuit has a pulse width determined by propagation delays. Additionally, the pulse may have insignificant width to guarantee that both flip-flops are reliably cleared.
- High level view of a manufacturer supplied macro. The use of the terminal count as a clock may result in circuit malfunction since it may "glitch."
- Asynchronous decoding of a synchronous counter.
- Interfacing logic blocks powered by individual power supplies. Most CMOS devices present a low-impedance on the I/O lines when powered off. Some new devices incorporate special circuitry for this situation.
- TMR configuration without scrubbing. This register is loaded with no provision made for scrubbing SEUs either in hardware with a free running clock and feedback or by software, by reloading the register at sufficiently fast intervals.
- Example of a logic race. Signals B and C both originate synchronously from signal A, a 2 MHz clock. The two signals, however, have a race and tSU and tH can not be guaranteed to be met.
- Logic synthesizer can generate circuits with excessive clock skew. Depending on the synthesizer and its settings, the CAE software can generate logical structures that are unacceptable. This circuit fragment is part of a shift register generated from standard VHDL code.
- High-level design flow. An FPGA was designed solely using high-level design tools without knowledge of the underlying architecture nor the radiation characteristics of different structures. The software mapped the logical design onto hardware structures that could not meet the SEU requirement.
- Hardwired flip-flop in a commercial FPGA architecture. The CAE tools chose to map flip-flops onto a hard-wired flip-flip in the Act 2 architecture for compactness and speed. This flip-flop is SEU soft. Figure 30B. Flip-flop made from combinational logic resources. A standard macro in the Act 2 architecture, this flip-flop construction is often selected for it's radiation-tolerant characteristics. The feedback for each latch goes through antifuses and the routing network.
- SEU data summary for two members of the Act 2 family. The SEU performance of two types of flip-flops are shown, with the higher cross-section and lower LET for the "hard-wired" storage elements. Flip-flops made from combinational logic using the routing network perform substantially better. Neither of the two flip-flop structures could meet the project SEU requirements.
- Design flow using netlist translation. An Altera design was mapped to an Act 1 device using CAE software. The software created logical structures unsuitable for the space flight environment.
- Flight design, before and after netlist translation. The translation software took a single flip-flop in the Altera design and mapped it into two flip-flops in the A1020 design. This circuit, although logically correct, is unacceptable since it is used both as a synchronizer and for control of high motor currents. Differing values in each of the two flip-flops resulted in an over-current condition.
- Logic replication to improve performance. CAE software can create multiple copies of flip-flops to improve system timing. This has been observed both in synthesis and in back end software.
- Simple finite state machine. CAE software may implement logic that has lockup states.
- One-hot implementation of a simple FSM. This circuit, generated by a VHDL synthesizer, produces an state machine implementation with lockup states.
- Modified one-hot implementation. This circuit is more robust than the implementation shown in Figure S9. Results vary as a function of synthesizer, revision level, and settings. Note that this implementation uses the all 0's state as a legal state.
- Delay generation. This circuit (not a particularly good one) failed to achieve the delay intended by the designer. A recent revision of CAE software changed its default behavior and the optimizer eliminated the logically unneeded gates.
Conclusions
This paper has discussed and analyzed a large number of failures in spacecraft systems. Clearly, the majority of these errors should have been caught earlier in the design cycle. Although not an exhaustive list of failures, we have included a select set to illustrate general principles and share experiences with designers of future logic systems.
It is noted that the space industry has seen large numbers of failures over the past few years resulting in losses of billions of dollars. Are these industry failures related to each other in any way? Is there a common root cause to these failures? Perhaps the failures are simply a statistical "blip." This is an open question and various committees and study groups are exploring the issue. It will be interesting to see what parallels, if any, exist in the failures of the space community in general and the logic design community in particular.
In discussing logic design, we grouped the failures into three general categories. These were high level principles and technology, low-level hardware, and low-level software. Many of these errors seem obvious; yet, they are seen frequently and should be discussed. Some errors are a result of the injection of the latest commercial technology, originally not designed for high reliability systems.
At the highest level, we saw many "Mom and apple pie" rules violated. For example, the lack of stable specifications, continuity of personnel, and detailed reviews remain. Reviews in particular are worthy of additional mention. Sometimes the detailed design review is skipped; other times, the review consists of perhaps an hour or so of discussion, with the reviewers being forced to "sight-read" the material in "real-time." This class of review is sufficient only for checking off a so-called action item that needs to be done. This can not replace a solid, detailed review. Lastly, for errors of this class, it is obvious that simulations and testing can not replace proper design and analysis. As was shown, some circuits can not be verified by test. In addition, it is difficult to cover all cases with simulations. Furthermore, the simulators and models are limited in their fidelity and often aren't a true model of the physical circuit and all of its characteristics. Understanding the limits of a simulation is critical to the proper use of this tool.At the lower levels, both hardware and software, most of the errors have a common root cause. These errors are typically relatively simple, when properly explained. The circuits in question are not overly complex and there is little mathematics involved. Indeed, by penetrating the design, going to yet lower levels, the flaws become rather obvious. One technique for dealing with design complexity is abstraction, with the implementation of lower levels being a "solved problem." Not understanding the underlying technology is the root cause for most of the errors analyzed in this paper.
There are a number of reasons that we can postulate for these problems, which seems to be of almost epidemic proportions.
First, the logic industry tends to try to shield the engineer from the lower level technology details. This, in principle, enhances productivity and to no small extent is a function of marketing departments. For example, we see one manufacturer who claims on their world wide web site that their parts "are instantly operational on power-up." This is most definitely not the case. Another vendor marks their parts as "radiation-hardened" although they are considered radiation-soft with the MTBF for upsets in control bits is measured in hours. Synthesizer vendors promote device independence if the engineer designs his logic using an HDL and lets the synthesis tool generate the logic for him. The algorithms that the synthesis tool uses are not published or controlled. Additionally, for some newer models of FPGAs, vendors are no longer providing libraries for schematic tools; therefore engineers lose direct control of the hardware as schematic capture is not supported is being discontinued. As a result, one must code in an HDL and let a third-party synthesis tool generate the logic. CAE tools do not always produce good, robust circuits for high-reliability applications.Management plays a role in today's logic design in a number of ways. First, by observation, the line manager is often not technically active nor up to date on the technology used in flight programs. Consequently, supervisory personnel often can not provide proper detailed guidance nor be able to quickly spot flaws and point out solutions. This has also been noted in other technical fields [20]. Many managers today are not promoted for technical and leadership skills. In a welding shop, one will find that the foreman is by far the most skilled welder and can quickly decide on how to solve a particular problem; for logic design, that is not the case. The leader is more than likely to be an administrator. Virtually all of the designs that we are aware of have been subjected to a number of design reviews such as the traditional PDR and CDR and sometimes a "peer review." Short presentations have replaced true independent detailed reviews. While this type of review can catch some errors, it is often insufficient, as the details of a design can not be adequately reviewed in a presentation lasting perhaps 15 minutes. Another factor is the training of logic designers. Engineers are frequently assigned positions not based on skill and experience. As a result, there are inadequately supervised engineers who are not aware of basic concepts such as metastable states and the SEU performance of the devices that they are using.
Engineers must probe farther and deeper into the technologies that they are using. The devices in use today have far more power than the devices used 30 or 40 years ago. While early digital electronics was constructed from simple components, that is not the case today, with device complexity orders of magnitude higher. Understanding the technologies and, just as importantly, the tools, is fundamental to the design and construction of reliable systems. The failure to penetrate the technologies and understand how they work, not just how to use them, is also a fundamental skill that needs to be developed. A reliance on readily available and seemingly powerful tools to understand and manage the lower levels of the technology will result in failure.Lastly, there has been a lot of discussion of the effects of strategic policies such as "faster, better, cheaper" or "FBC." One can make a case that taking more time, having more thorough reviews, and running more tests will help reduce the frequency of errors. However, it is noted that large, expensive programs also have had problems. Indeed, the examples used for case studies in this paper came from the most expensive programs as well as FBC programs. Additionally, the designs came from industry, government, and academic institutions. It is felt that the problem, or the "disease," is widespread. No correlation has been found linking logic design problems to the cost of a program or the organization managing it.
Home
Last Revised January 09, 2002
Digital Engineering Institute
Web Designer: Richard Katz