Update: March 26, 2004: Added references 18, NASA Advisory.
Update: February 27, 2004: Added references 16 and 17.
Update: November 24, 2003: A NASA Advisory has been written based on OLD News #14 and is currently in the release cycle.
Date: November 19, 2003
This is the fourteenth in a series of OLD News articles.
Introduction
Modern microelectronics for space are rapidly progressing to ever smaller feature sizes and operating voltages. Feature sizes are 0.25 µm and below with operating voltages of 2.5V and below. These small, low voltage transistors give the designer high performance and lower power; they also present engineers with challenges for the safe handling and operation of the devices. OLD News #11 discussed ESD and interface components. This note will focus on the challenges presented by modern FPGAs, a discussion of some recent failures both on boards and in upscreening tests, and "does and dont's" for the testing and application of these components.
Discussion
The reliability of modern VLSI components, based on published reliability data, exceeds the reliability of smaller and far simpler SSI, MSI, and LSI parts that were considered "hi-rel" a generation ago. A fundamental question arises: What testing, if any, should the user perform following receipt and programming? Will such testing improve system reliability by "screening out" defective devices that would escape board-level testing or will it decrease system reliability by overstressing the devices through the extra handling and testing in various test fixtures? Even though some improvement can theoretically be attained through additional tests and screens, it must be balanced against the potentially substantial decrease in reliability introduced through mishandling and faulty testing techniques and methods.
Field Reliability: In general, FPGAs in the field have been highly reliable in commercial, military, and aerospace applications and specifically vendors of these devices used for space applications have shipped millions of parts providing a good quantity of real-world data. Recently, however, an aerospace contractor has reported clusters of unexplained failures which can not be ignored. Another contractor reported a cluster of failures during "upscreening" tests. While the manufacturer's analysis concluded that these devices failed as a result of electrical overstress, the contractors dispute these claims and, regrettably, do not permit the necessary access to the the test environments, data, analysis, or reports, thus excluding the possibility of an independent review.
The Reliability Objectives: In designing a testing and screening program, the driving issue is the reliability goal for a program in general and the FPGAs in particular. What increase in reliability of a device as a result of the proposed additional testing is needed and expected? Will the proposed testing and screening regime demonstrate the increased level of reliability and is the regime well designed? Surprisingly, advocates of extensive third-party testing do not have quantitative or analytical answers for these questions. The upscreening is "justified" by a "that's what we have always done" or a "we want to make it better" argument and is not based on a need for improvement or a defensible engineering analysis.
The Vendor Position: Two vendors of aerospace FPGAs, Actel and Xilinx, both state unequivocally that upscreening or any other extra testing, such as that performed on a VLSI tester or burn-in chamber, performed by a user or a third-party voids their warranty. They base this position on the complexity of the devices, the required knowledge of the device's internal circuits and architecture, the sensitivity of the devices to inappropriate testing and/or screening, their experience gained over the years in developing test programs and methods, and analysis of third-party testing. Based on our examination of third-party facilities, techniques, and procedures, as documented below, this policy is well-justified.
Examination of Test Facilities, Procedures, and Techniques: Over the past several years, test methods, equipment, and procedures have been examined at various third-party test facilities. Not a single facility was able to test the part to flight standards; that is, had the test equipment been subject to a normal design review, it would have been rejected as unsafe for the device under test. The personnel designing the test, writing the test procedures, and running the test facilities were not able to answer fundamental questions with regard to either the devices they were testing or the application of the electrical test equipment they were using, thus representing an unknown level of risk to the safety of the flight devices.
Examples of issues found during examination of test facilities, procedures, and techniques include:
- Failure to understand and measure critical parameters. The test engineer did not pay attention to the dual-supply voltage nature of the part and only reported the VCCI currents, which is not representative of the health of the logic array. As a result, the "setup part" used to ensure safety of all fixtures had major damage that went undetected.
- The specification value for input transition time was exceeded by approximately an order of magnitude.
- Devices were not stressed appropriately for the failure mechanism that was to be accelerated, e.g., the wrong temperature was used.
- Lack of knowledge of the characteristics of device pins leading to improper test fixture design:
- Control signals held at non-logic levels or left floating.
- Unused inputs left floating, resulting in excess device currents (totem pole).
- Lack of knowledge of device architecture
- A dynamic burn-in test only exercised test circuits and not the flight circuits.
- Device parametric failures wrongly attributed to design-specific loading on a clock network that is, in fact, fixed and can not be altered.
- Long cables for power/ground runs with no capacitance on the device under test (DUT) card likely resulting in an overvoltage condition during the turn-on transient.
- Long cables from a signal generator to the DUT's clock pins. The test engineers were not familiar with either the concept of terminating transmission lines or the characteristics of the signal generator. Damage and out of specification performance of the clock pins were subsequently reported. The manufacturer's analysis determined the device was electrically overstressed.
- Chairs striking the test equipment -- a.k.a. "chair hits" -- caused an ESD event with an overvoltage transient on the DUT's supply pins and part "resetting."
- Pseudo-random stimulus vectors input into the device for a post-programming burn-in test had no known phase relationship to the clock, which was generated by an external signal generator.
- Improper control of test vectors for bidirectional buses, with no "break-before-make" technique employed for bus direction switching; this results in contention between the DUT and the pin drivers of the test equipment and, of course, potential overstress.
- Devices mechanically damaged when loaded onto test boards.
- Devices electrically damaged when transients from the facility's temperature chamber resulted in large spikes on the DUT's power pins.
- Post-programming burn-in test improperly designed. Pseudo-random signals drive all of the DUT's pins, including any resets, resulting in inadequate dynamic stimulus to internal device nodes.
- DUT improperly mated to socket resulting in floating inputs and excess device currents.
Conclusion and Recommendations
Several clusters of failures of FPGAs at aerospace contractors have recently been reported. While the manufacturer's analysis concluded that the failures were a result of electrical overstress, the contractors dispute this but will not permit a complete independent review of the data, analysis, test environment, and reports. The limited data available shows significant problems including an unsafe environment as well as basic design and test errors that potentially compromise the integrity of flight hardware. Modern devices should be handled carefully in a controlled environment and ESD rules should be followed religiously. Additionally, shorting plugs on power lines and installing capacitors prior to microcircuit installation can help eliminate problems.
All failures, from all phases of test, should be reported and diagnosed, with the NASA Office of Logic Design providing a resource for this purpose. This will permit full data sets and trends to be properly analyzed.
Flight electronics designs go through a series of reviews and qualification testing; hardware that is used to test modern flight microelectronics must meet the same design standards. Any handling or testing of flight microelectronics must be justified, performed to flight standards, and meet all manufacturers' specifications and well-established good engineering practices. If these conditions have not been met and the testing and/or screening have not been proven safe, then the devices should not be considered acceptable for flight. As has been seen in the examination of a number of test facilities, these standards are often being violated and thus presenting a "clear and present danger" to the integrity of the flight hardware.
- "Post Programming Burn In (PPBI) for RT54SX-S AND A54SX-A Actel FPGAs," Minal Sawant, Dan Elftmann, John McCollum, Werner van den Abeelen, Solomon Wolday and Jonathan Alexander, 2002 MAPLD International Conference, September 9-11, 2002, Laurel, MD.
- "OLD News #11: Interface Components and ESD," May 28, 2003.
- "How Burn-in can Reduce Quality and Reliability,"XL J. Jordan, M. Pecht, and J. Fink The International Journal of Microcircuits and Electronic Packaging, Vol. 20, No. 1, pp, 36-40, First Quarter, 1997.
- "A Physics-of-Failure Approach to IC Burn-In,"XL M. Pecht and P. Lall, Proceedings 1992 Joint ASME/JSME Conference on Electronic Packaging: Advances in Electronic Packaging April 9-12, 1992; also 21st Joint Hybrid Microelectronics Symposium, ISHM, Cherry Hill, NJ, May 27-28, 1992.
- "Reliability Report," Actel Corporation, Q2 CY2003 August 11, 2003.
- "Xilinx Reliability Report," January, 2002.
- Influence of Temperature on Microelectronics and System Reliability, Chapter 6, "A Physics-of-Failure Approach to IC Burn-In," Pradeep Lall, Michael G. Pecht, Edward B. Hakim
- "Actel Corporation COTS and Up-Screening Policy," Dr. Esmat Hamdy Senior Vice President, Technology and Operations, Actel Corporation.
- Xilinx Upscreening Policy, Joseph J. Fabula Director, Quality Assurance.
- "Summary of October 8, 2003 Meeting on Actel FPGA Failures," R. Katz, M. Fraeman, and J. Boldt to S. Scott (NASA) and E. Hoffman (JHU/APL).
- "Reliability," from Advanced Design: Designing for Reliability, presented at the 2001 MAPLD International Conference, Laurel, MD, September 10, 2001.
- "Conducting Filament of the Programmed Metal Electrode Amorphous Silicon Antifuse," R. Wong and K. Gordon, International Electron Devices Meeting, December 1993
- "On-State Reliability of Amorphous Silicon Antifuses," Zhang, G. King, Y. Elfoukhy, S. Hamdy, E. Jing, T. Yu, P. Hu, C., Electron Devices Meeting, 1995. Washington, DC pp: 551-554.
- "Characterization and Modeling of a Highly Reliable Metal-to-Metal Antifuse for High-Performance and High-Density Field Programmable Gate Arrays,"
- "Time Dependent Reliability of the Programmed Metal Electrode Antifuse," R. Wong, K. Gordon, and A. Chan, International Reliability and Physics Symposium, April 1996
- "The First Summary Report on the Independent Review of SX-S FPGA Reliability on NASA Space Flight Missions," February 11, 2004.
- "Brief Notes on Recent FPGA Failures," January 2004.
- "NASA Advisory: Actel RTSX-S and SX-A Programmed Antifuses" March 26, 2004.
Notes and Additional Recommendations
Actel Reliability Summary:
- Programmed Dynamic Burn-in at Actel [Sawant 2002]
- 632,000 device-hours of burn-in at HTOL
- 354,000 device hours of burn-in at LTOL
- No antifuse related failures reported.
- < 10 FITS for 0.25 µm, MEC FPGAs [Actel 2003]
Xilinx Reliability Summary (FITs): [Xilinx 2002]
- 0.15 um: 12
- 0.18 um: 22
- 0.22/0.18 um: 5
- 0.22 um: 26
- 0.25 um: 6
As flight electronics designers we should remember to: [Design Guidelines and Criteria]
- Use proper ESD procedures when handling these parts;
- Ensure that our designs properly handle unused inputs including global pins such as clocks;
- Ensure that there are no bus fights on bidirectional signals, as the drivers are quite strong and the busses are getting wider in our applications;
- Verify that any auxiliary pins such as TRST* are hard grounded;
- Ensure there are no significant transients on either of the power supplies and that they are well bypassed;
- Ensure there is no significant overshoot on input signals.
In my new OLD (Office of Logic Design) position, I am now making some of my informal e-mail lists semi-formal. These mailings will have pointers to technical tips that can [hopefully] proactively prevent errors from getting into flight designs or make things go faster and smoother. I have included an array of people from a number of organizations; different NASA Centers, ESA, etc., as you all may distribute to people in your own organizations and other colleagues. Please let me know if you are on this list in error or if someone should be added to it. This list is targeted towards those that either will design or review space flight digital electronics. Feel free to suggest topics for discussion and research or to contribute news items. [Note for this web-based release: to become a recipient on this mailing list, please send e-mail to: richard.b.katz@nasa.gov.]
All application notes are uploaded onto my www site. New additions are noted on the what's new page. I will give these mailings from time to time; too much and they will be filtered and ignored - too little and not enough information flows. So I'll try and hit a good balance.
Best regards,
-- rk
Home - NASA
Office of Logic Design
Last Revised: March 26, 2004
Digital Engineering
Institute
Web Grunt: Richard
Katz
