|
in a special issue of Reliability Engineering & System Safety |
Scope (excerpt) This special issue is intended to provide a forum for the exchange of views about accident and incident investigation, modeling, and reporting across many different application domains, including but not limited to chemical process industry, healthcare, the aviation, rail, marine and offshore industries, and nuclear applications. |
|
Song Peng and Rajit Manohar |
Abstract This paper presents a systematic method for the design of a reconfigurable self-healing asynchronous adder. We propose a graph-based model for the design of a fault-tolerant linear array with external inputs and outputs with a minimum number of spare resources. A K-fault-tolerant asynchronous adder design is presented based on this analysis, together with the necessary support logic for dynamic self-reconfiguration. Experimental evaluations show that our method incurs both low hardware cost and small performance overhead compared to traditional approaches to fault-tolerance. |
|
Song Peng and Rajit Manohar |
Abstract This paper presents an efficient concurrent failure detection method for pipelined asynchronous circuits. We first validate permanent and transient fault modeling in clockless systems. By augmenting the rails to each data channel and adding extra logic to each circuit module, we make pipelined asynchronous circuits achieve fail-stop with respect to hard or soft errors. The experimental evaluations show this method incurs both reasonable hardware cost and low performance overhead. |
|
Christopher LaFrieda and Rajit Manohar |
Abstract This paper presents a novel circuit fault detection and isolation technique for quasi delay-insensitive asynchronous circuits. We achieve fault isolation by a combination of physical layout and circuit techniques. The asynchronous nature of quasi delay-insensitive circuits combined with layout techniques makes the design tolerant to delay faults. Circuit techniques are used to make sections of the design robust to non-delay faults. The combination of these is a asynchronous defect- tolerant circuit where a large class of faults are tolerated, and the remaining faults can be both detected easily and isolated to a small region of the design. |
|
Andrew Kostic |
Introduction Legislation has been passed in Europe to remove lead from electronics assemblies. But, in addition to component surface finishes, solder containing lead is also prohibited. Lead-free tin-based solders are still relatively new and are untested in widespread applications. Many of their characteristics have yet to be fully defined. One area in which there has been little research is in their behavior during long periods at low temperatures. This paper addresses a characteristic of high tin content lead-free solders that may not be familiar to many people, tin plague. It is the disintegration of pure tin into powder as it loses its crystalline structure at low temperatures. It is called a “plague” because it appears to spread like a disease. The phenomenon is also known as "tin disease" or "tin pest." |
|
Jaume Segura, Ali Keshavarzi†, Jerry Soden††, Charles Hawkins†††
University of the Balearic Islands International Test Conference, Paper 4.2 |
Abstract Defect-based test studies have thoroughly characterized CMOS IC hard bridge and open defects while less is known about a third class called parametric failures. These are more difficult to detect, and their presence is growing in CMOS IC nanoelectronics. The objective of this work is to present data that encompass the electronic properties of parametric failures that affect our ability to test present and future CMOS ICs. While parametric failures are widely reported, we seek to classify these failures with supporting data. Solutions to this complex test problem require that we structure and formalize their behaviors. Data indicate that multiparameter test strategies have the best match to some of the failures while good test strategies do not exist for others. |
|
Philippe David and Claude Guidal |
Abstract This paper presents the Fault Tolerant Computer System that has been developed and tested by Matra Marconi Space in the framework of European space shuttle HERMES project. This system has been designed to cope with high safety and reliability requirements (FO/FS and less than 10-6 for the probability of a catastrophic event induced by a system failure). The system is composed of 4 tightly synchronized computers implementing a fault masking concept based on a bit-to-bit vote. The paper presents the major requirements and the rationale that led to the actual architecture. It provides a detailed technical description of the system, addressing functional, hardware, and software aspects. It provides information about the development activities and presents the results and lessons learned. |
Frison, S.G.; Wensley, J.H. IEEE 1982 |
Abstract It is well known that in a TMR system it is not possible to guarantee correct operation in situations where correctly functioning processors can legitimately have differing results and where one faulty processor carries out a pattern of activity that confuses the two correctly functioning processors. The paper analyzes this general problem as it applies to certain design issues that arise in building a fault tolerant TMR system. The issues addressed are synchronization, the processing of analog data, and the handling of interrupts. It is shown that certain simplistic solutions that have been applied in the past fail to provide a guarantee of complete coverage of faults. Approaches are described that, while not guaranteeing absolute fault tolerance, can be shown to provide fault tolerance except against an extremely unlikely pattern of behavior of a faulty processor. |
How Burn-in can Reduce Quality and Reliability J. Jordan, M. Pecht, and J. Fink |
Abstract Burn-in is an accelerated screen, performed to precipitate defects in microelectronic parts, so tha defect-related failures do not occur in the field. The data presented in this paper show that burn-in does not precipitate a significant amount of failures and has, in fact, the high potential to cause problems that are not detected during post burn-in examinations. This has serious safety implications especially in terms of military electronics and commercial avionics where burn-in is generally recommended. Further, the results suggest that the philosophy of burn-in elimination (i.e. burn-in until there exists few problems) is no longer appropriate. |
|
M. Pecht and P. Lall Proceedings 1992 Joint ASME/JSME Conference on Electronic Packaging:
Advances in Electronic Packaging |
Abstract Over the years the, the process of burn-in gas deteriorated into an insurance policy to check reliability or satisfy customer imposed requirements. Burn-in procedures are often conducted, without any prior verification of the nature of the defects to be precipitated, the failure mechanisms active in the device, their sensitivity to steady state temperature stress, or any quantitative evidence of the improvement achieved by the process. In fact, current failure data indicates that burn-in prior to usage does not remove many failures and on the contrary may cause failures due to additional handling. This paper examines the problems in the existing burn-in methods and presents a physics of failure approach to burn-in. |
C.F. Larry Heimann American Political Science Review |
Abstract The destruction of the space shuttle Challenger was a tremendous blow to American space policy. To what extent was this loss the result of organizational factors at the National Aeronautics and Space Administration? To discuss this question analytically, we need a theory of organizational reliability and agency behavior. Martin Landau's work on redundancy and administrative performance provides a good starting point for such an effort. Expanding on Landau's work, I formulate a more comprehensive theory of organizational reliability that incorporates both type I and type II errors. These principles are then applied in a study of NASA and its administrative behavior before and after the Challenger accident. |
Joseph M. Benedetto |
Abstract |
J.E. Tomayko Journal of the British Interplanetary Society |
Abstract Computers are a key component onboard manned spacecraft. Gemini, Apollo, Skylab and the Space Shuttle all carried computer systems of increasing functionality and complexity. All the computer hardware involved in those systems was rated at 95 per cent reliability or better; yet in no case was a computer system implemented without some alternative method of performing critical functions so that crew safety was assured. How the National Aeronautics and Space Administration (NASA) gained the last five per cent of near total reliability is the story of the evolution of the concept of "backup" to the concept of "redundancy." Success of this evolution is epitomized by the Shuttle, which did what no manned spacecraft had ever done: carry men on its first test flight. The main factor in enabling NASA to take such a risk was the redundancy built into the Orbiter. |
H. Hecht Journal of Spacecraft and Rockets |
Introduction Fault-tolerance techniques for spacecraft computers have been under investigation since the early 1960's, motivated by the desire for greater reliability than can be provided in conventional (simplex) computers. Yet, to date, fault tolerance techniques have gained only modest acceptance in operational spacecraft. The obstacles have been primarily the large development cost and the onboard resources required for such a computer. The computational needs have been met either by relegating most of the data processing to the ground or by using several onboard computers with the capability of switching from one to the other by ground command. The obstacles to employment of fault-tolerant computers on spacecraft have now been reduced considerably because of advances in computer architecture and coding theory and particularly due to the tremendous progress in semiconductor technology which permits the required logic functions to be realized at much lower weight, power consumption, and cost. At the same time, there is a greater need for fault-tolerant computers due to longer mission durations, much more demanding mission objectives (in spacecraft management and in payload data processing), and the desire for greater autonomy. Thus, the results of previous research and development efforts on fault-tolerant computers for spacecraft applications are now bearing fruit. Fault-tolerant computing for general applications is today a well-established discipline served by annual symposia.1 A number of excellent survey documents are available.2-4 Various techniques that can be applied to achieve fault tolerance in a digital component are described below. This is followed by a summary of some major attempts to design or develop fault-tolerant computers for space applications. At present, the furthest progress along the road to hardware realization has been achieved in the Fault-Tolerant Spaceborne Computer being developed for SAMSO, and this is described in a later section in some detail, together with an outline of how the reliability of such an essential component can be demonstrated at reasonable resource expenditure. The final section of this paper delineates current developments in reliability estimation, microprocessor applications, and fault-tolerant software which may affect future work in the spacecraft computer field. |
C. Michael Holloway Conference Proceedings of the 17th International System Safety Conference |
Abstract Although differences exist between building software systems and building physical structures such as bridges and rockets, enough similarities exist that software engineers can learn lessons from failures in traditional engineering disciplines. This paper draws lessons from two well-known failures--the collapse of the Tacoma Narrows Bridge in 1940 and the destruction of the space shuttle Challenger in 1986--and applies these lessons to software system development. The following specific applications are made: (1) the verification and validation of a software system should not be based on a single method, or a single style of methods; (2) the tendency to embrace the latest fad should be overcome; and (3) the introduction of software control into safety-critical systems should be done cautiously. |
Eugene Rygwalski 1967 Annual Symposium on Reliability |
In reviewing the past failures and successes experienced on the U. S. Space Programs and extrapolating this experience to future systems, it is shown that future program reliabilities can be estimated through determination and the adaptation of complexity factor ratings. In these studies a subjective approach was utilized in developing complexity factors. While this technique is considered adequate for providing relative comparison and criteria, efforts should be expended to establish quantitative complexity factors to more accurately identify the system effectiveness tradeoff parameters. No systematic effort of utilizing failure experience from past space programs was indicated in any of the discussions with the personal contacts or in the literature items reviewed. The most probable reason for this is in the difficulty of developing the absolute complexity factors measurement between programs, in spite of the similarity between subsystems. By utilizing the past space program flight reliability history and by extrapolation this data using these complexity factors, the following conclusions can be made for long duration earth orbital space missions:
|
Frank A. Barta 1967 Annual Symposium on Reliability |
Abstract This paper presents reliability's role in influencing the design of hardware for two major Hughes Aircraft Company programs: the lunar soft-landing spacecraft, Surveyor (developed for NASA/JPL) and the communications satellites: Syncoms 1, 2, and 3, the Applications Technology Satellites (developed for NASA), Early Bird, and four Intelsat IIs (developed for Comsat). since an overview of approximately 5 years of the programs' operation (or a combined total of more than 10 years) is covered, only a selected number of reliability items are presented. Some of the results obtained early in the programs, such as the evolution of the parts program during the various phases of design, are reviewed. The savings resulting from elimination of parts failures during system tests, Hughes' derating policy with previously unpublished derating curves for high reliability operation, and levels of parts acceptance are also reviewed. Included are management controls involving Trouble and Failure Reports, necessary steps to ensure corrective action, and methods of transmitting pertinent information to key management personnel. Operation of the consent-to-ship and consent-to-launch procedures and the review of actions taken at lower organizational levels by top-management committees are described. (Acceptance or rejection of the committees' findings determines whether or not a spacecraft is shipped or launched.) In addition, a brief status report of all operational hardware, data on hardware approaching operational readiness, and data affecting failure rates are presented. |
W.T. Sumerlin 1967 Annual Symposium on Reliability |
Abstract Certain reliability management techniques believe important to the reliability achievements of Projects Mercury, Gemini, and the F-4 Phantom II aircraft are examined for similarity to those employed in four earlier successful projects with which the author was connected. The case history method of consideration is employed. |
George S. Gordon 1967 Annual Symposium on Reliability |
Summary This paper contains a description of the "Failure Reporting, Analysis, and Correction System" as practiced by the Astro-Electronics Division of RCA and a description of the supporting analysis facilities. Several case histories are presented, demonstrating the typical depth and breadth of investigation often required in support of space programs. The system is dedicated to inhibit failure recurrence and to promote the cross breeding of corrections to the benefit of all programs. Conclusion [Very good paper, good case studies, and good principles. Still relevant. -- rk] |
| "The Effects of Space Environment on Spacecraft Reliability" John B. Singletary 1967 Annual Symposium on Reliability |
Introduction Because designers of early spacecraft were uncertain of the effects (then largely unknown) of the space environment upon the properties of materials intended for orbiting vehicles, they were forced to consider many possible adverse environmental effects upon vehicle reliability. The dearth of accurate information about these environmental effects was reflected in a conservative approach to design which led, in turn, to unexpectedly striking performances by certain early spacecraft as, for example, the long life of the Vanguard 1 satellite. During the past few years, much information on the nature of the space environment and its effects on spacecraft materials reliability has been accumulated. The effect of these new data has been to relegate to lesser importance some of the problems which initially were thought to be troublesome; at the same time new and different problem areas have arisen. In particular, the planning of programs for longer space missions (1 year or more), manned space stations, and manned and unmanned planetary missions have imposed more exacting demands upon the reliability of the properties of materials for spacecraft applications. |
The Impact of the Space Environment on Space Systems July 20, 1999 Prepared for: Aerospace Report No. TR-99(1670)-1 TR-99-1670-1.pdf
(Full report) |
Abstract We have undertaken a study to determine the impact of the space environment on space systems. Known impacts include mission outages, mission degradation and mission failure, launch delays, redesign and retest, anomaly analyses, and the ultimate cost for each of the preceding. We are attempting to quantify these impacts whenever possible. This task is made difficult because impacts are rarely formally documented. We reviewed a variety of sources for anomaly impact information. These sources include anomaly reports from the archives of the Space Sciences Department of The Aerospace Corporation, and contractor reports and published documents relating to spacecraft anomalies. The study provides a good indication of the quality and quantity of the data available. It also shows the degree to which it is possible to obtain impact information for historical anomalies. We summarize the results of the study, and emphasize those causes for which it may be possible to provide predictive information such as surface charging, internal charging, and the single-event upsets that accompany solar proton events. |
Kathryn A. Weiss, Nancy Leveson, Kristina Lundqvist, Nida Farid and Margaret Stringfellow, Software Engineering Research Laboratory, Department of Aeronautics and Astronautics, Massachusetts Institute of Technology, Cambridge, MA Presented at Space 2001 |
Abstract |
John C. Knight1 and Nancy G.Leveson2 IEEE Transactions on Software Engineering |
Abstract N-version programming has been proposed as a method of incorporating fault tolerance into software. Multiple versions of a program (i.e. N) are prepared and executed in parallel. Their outputs are collected and examined by a voter,and, if they are not identical, it is assumed that the majority is correct. This method depends for its reliability improvement on the assumption that programs that have been developed independently will fail independently.In this paper an experiment is described in which the fundamental axiom is tested. A total of twenty seven versions of a program were prepared independently from the same specification at two universities and then subjected to one million tests. The results of the tests revealed that the programs were individually extremely reliable but that the number of tests in which more than one program failed was substantially more than expected. The results of these tests are presented along with an analysis of some of the faults that were found in the programs. Background information on the programmers used is also summarized. The conclusion from this experiment is that N-version programming must be used with care and that analysis of its reliability must include the effect of dependent errors. |
John C. Knight1 and Nancy G.Leveson2 ACM Software Engineering Notes, January 1990 |
Introduction (excerpt) In July 1985, we presented a paper at the Fifteenth International Symposium on Fault-Tolerant Computing [KNI85] describing the results of an experiment that we performed examining an hypothesis about one aspect of N-version programming, i.e., the statistical independence of version failure. A longer journal paper on that research appeared in the IEEE Transactions on Software Engineering in January 1986 [KNI86]. Since our original paper appeared, some proponents of N-version programming have criticized us and our papers, making inaccurate statements about what we have done and what we have concluded. We have spoken and written to them privately attempting to explain their misunderstandings about our work. Unfortunately subsequent papers and public pronouncements by these individuals have contained the same misrepresentations. |
Susan Brilliant, John Knight, and Nancy Leveson. IEEE Trans. on Software Engineering Vol. SE-16, No. 2, February 1990 [couldn't open the .pdf file] |
Introduction More details about the actual errors found in the multiple version programs and and an explanation of why they caused correlated failures. |
Susan S. Brilliant, John C. Knight, Nancy G. Leveson IEEE Trans. on Software Engineering |
Abstract We have identified a difficulty in the implementation of N-version programming. The problem, which we call the Consistent Comparison Problem,arises for applications in which decisions are based on the results of comparisons of finite-precision numbers. We show that when versions make comparisons involving the results of finite-precision calculations, it is impossible to guarantee the consistency of their results. It is therefore possible that correct versions may arrive at completely different outputs for an application that does not apparently have multiple correct solutions. If this problem is not dealt with explicitly,an N-version system may be unable to reach a consensus even when none of its component versions fail. |
THE USE OF SELF CHECKS AND VOTING IN SOFTWARE ERROR DETECTION: AN EMPIRICAL STUDY Nancy G. Leveson1, Stephen S. Cha1, John C. Knight2, and Timothy Shimeall3 1Information & Computer Science Dept. 2Computer Science Dept. 3Computer Science Dept. IEEE Trans. on Software Engineering |
Abstract This paper presents the results of an empirical study of software error detection using self checks and n-version voting. A total of twenty-four graduate students in computer science at the University of Virginia and the University of California, Irvine, were hired as programmers. Working independently,each first prepared a set of self checks using just the requirements specification of an aerospace application, and then each added self checks to an existing implementation of that specification. The modified programs were executed to measure the error-detection performance of the checks and to compare this with error detection using simple voting among multiple versions. The goal of this study was to learn more about the effectiveness of such checks. The analysis of the checks revealed that there are great differences in the ability of individual programmers to design effective checks. We found that some checks that might have been effective failed to detect a fault because they were badly placed, and there were numerous instances of checks signaling non-existent errors. In general, specification-based checks alone were not as effective as combining them with code-based checks. Faults were detected by the self checks that had not been detected previously by voting 28 versions of the program over a million randomly-generated inputs. This appeared to result from the fact that the self checks could examine the internal state of the executing program whereas voting examines only final results of computations. If internal states had to be identical in n-version voting systems, then there would be no reason to write multiple versions. The programs were executed on 100,000 new randomly-generated input cases in order to compare error detection by self-checks and by 2-version and 3-version voting. Both self-checks and voting techniques found the same number of faults (18) for this input, although only 10 of these faults were in common, i.e., both found 8 faults that the other technique did not find. Furthermore, whereas the effective self checks detected all occurrences of errors caused by particular faults, 0fR-version voting triples and pairs were only partially effective due to correlated failures. n-version voting triples and duples were only partially effective at detecting the failures caused by particular faults. Finally,checking the internal state with self-checks also resulted in finding faults that did not cause failures for the particular input cases executed. This has important implications for the use of back-to-back testing. |
Timothy Shimeall1 and Nancy Leveson2 1Computer Science Dept. 2Information & Computer Science Dept. IEEE Trans. on Software Engineering |
Abstract Reliability is an important concern in the development of software for modern systems. The authors have performed a study that compares two major approaches to the improvement of software - software fault elimination and software fault tolerance - by examination of the fault detection (and tolerance where applicable) of five techniques: run-time assertions, multi-version voting, functional testing augmented by structural testing, code reading by stepwise abstraction, and static data-flow analysis. The study focused on characterizing the sets of faults detected by the techniques and on characterizing the relationships between those sets of faults. Two categories of questions were investigated: (1) comparisons between fault-elimination and fault-tolerance techniques and (2) comparisons among various testing techniques. The results provide information useful for making decisions about the allocation of project resources, point out strengths and weaknesses of the techniques studied, and suggest directions for future research. |
An Empirical Evaluation of the MC/DC Coverage Criterion on the HETE-2 Satellite Software Arnaud Dupuy, Alcatel, France Digital Aviations Systems Conference (DASC) |
Abstract In order to be certified by the FAA, airborne software must comply with the DO-178B standard. For the unit testing of safety-critical software, this standard requires the testing process to meet a source code coverage criterion called Modified Condition/Decision Coverage. This part of the standard is controversial in the aviation community, partially because of perceived high cost and low effectiveness. Arguments have been made that the criterion is unrelated to the safety of the software and does not find errors that are not detected by functional testing. In this paper, we present the results of an empirical study that compared functional testing and functional testing augmented with test cases to satisfy MC/DC coverage. The evaluation was performed during the testing of the attitude control software for the HETE-2 (High Energy Transient Explorer) scientific satellite (since that time, the software has been modified). We found in our study that the test cases generated to satisfy the MC/DC coverage requirement detected important errors not detectable by functional testing. We also found that although MC/DC coverage testing took a considerable amount of resources (about 40% of the total testing time), it was not significantly more difficult than satisfying condition/decision coverage and it found errors that could not have been found with that lower level of structural coverage. |
J. Visser, JPL 1967 Annual Symposium on Reliability |
Summary This report documents some of the preliminary results of the electronic parts sterilization program at the Jet Propulsion Laboratory (JPL). The program is geared to reflect current NASA sterilization policy. The primary objective of the electronic part sterilization programs is to establish an approved list of sterilizable electronic parts. The major effect of the current JPL program is concerned with heat sterilization studies on representative part types from each major part category, specifically in relationship to the reliability of the devices. |
| Preventing the Forward Contamination of EuropaXL |
Commission on Physical Sciences, Mathematics, and Applications, National Research Council NOTICE
|
Home - NASA
Office of Logic Design
Last Revised:
January 09, 2006
Web Grunt: Richard
Katz
