NASA Office of Logic Design

A scientific study of the problems of digital engineering for space flight systems,
with a view to their practical solution.

NASA SP-504: Space Shuttle Avionics System

Section 3 Redundancy Management

The Space Shuttle Program pioneered the development of modern redundancy management techniques and concepts. Although previous space programs used backup systems, they were usually dissimilar and generally degraded in performance with respect to the prime system. The mission dynamics for the vehicles in these programs were such that active/ standby operation with manual switching was adequate. Virtually all system functional assessment was performed on the ground using telemetry data. Only information required for immediate switchover decisions or other such actions was presented to the crew. The Space Shuttle system, however, presented a much different situation to the designer. The FO/ FS requirement, the drive toward onboard autonomy, and the rapid reaction times which prohibited manual assessment and switching were factors that had never before been seriously considered. In addition, the avionics system was required, for the first time, to assess the performance and operational status of and to manage the redundancy included in nonavionics subsystems such as propulsion, environmental control, and power generation. As might be expected in such a situation, numerous design issues arose, a number of false design starts had to be overcome, and a process thought initially to be relatively simple proved to be extremely complex and troublesome. Many of these issues are discussed in other sections as part of the treatment of individual subsystems and functions. Only the more general, comprehensive topics are included here.

The initial concept for managing redundant units was simply to compare redundant data, discard any input which diverged beyond an acceptable threshold, and select the middle value if there were three good inputs (or the average if only two were available). The keyword in this sentence is "simply," for virtually nothing proved to be simple or straightforward in this process. First, measures had to be taken to ensure that the data set to be compared was time homogeneous, that each value was valid from a data bus communication aspect, and that the data were valid in the sense of a tactical air navigation (tacan) lockon. The selection process had to be capable of correctly handling four, three, two, or even single inputs, and of notifying the user modules or programs of the validity of the resultant output. The fault-detection process had to minimize the probability of false alarms while maximizing the probability of detecting a faulty signal; these two totally contradictory and conflicting requirements made the selection of the threshold of failure extremely difficult. The fault isolation and recovery logic had to be capable of identifying a faulty unit over the complete dynamic range to be experienced in the data, of accounting for any expected unique or peculiar behavior, and of using BITE when faulted down to the dual-redundancy level. Finally, the system had to accommodate transients, degrade as harmlessly as possible, and provide for crew visibility and intervention as appropriate.

It-soon became apparent that each LRU, subsystem, and function would have unique redundancy management requirements and would therefore have to be treated individually. It also became apparent that, to provide the required emphasis and expertise, redundancy management would have to be treated as a function and assigned to a design group with systemwide responsibility in the area. Some of the more difficult design issues faced by this group are explored in the following paragraphs.

As indicated previously, the selection of thresholds at which to declare a device disabled proved to be a very difficult process. In an attempt to minimize false alarms, performance within 3s, of normal was established as the allowable threshold level for a parameter and v2 x 3s, as the allowable difference between compared parameters. in most cases, however,. the standard deviation or had to be derived analytically either because of insufficient test data or because the hardware test program was not structured to produce the required information. In some other cases, the system performance requirements precluded operation with an input at the 3s, level and the tolerance had to be reduced, always at the risk of increasing the false alarm rate.

Another task that proved difficult was mechanization of the fault isolation logic for system sensors such as rate gyros which, during the on-orbit phases, normally operated close to null. Under these conditions, a failure of a unit to the. null position was equivalent to a latent failure and proved impossible to detect even with quadruple redundancy. It could subsequently result in the isolation of a functioning device, or even two functioning devices if two undetected null failures occurred.

The first remedy for this anomaly prevented the erroneous isolation but resulted in a significant increase in RCS fuel usage, caused by frequent switching between selected signals which effectively introduced noise into the flight control system. The final solution, which prevented the anomalous performance, was immensely more complex than was the original 'simple" approach.

The redundancy management design process followed initially was to treat each system and function individually, tailoring the process to fit, then proceeding on to the next area. This compartmentalized approach proved inadequate in a number of areas in which the process cut across several subsystems, functions, and redundancy structures. A prime example is the RCS, which contains propellant tanks, pressurization systems, manifolds and associated electrically operated valves, and 44 thrusters used for flight control. The thrusters are divided into four groups, any two of which are sufficient to maintain vehicle control about all axes in all flight conditions. The other components (tanks, manifolds, valves, etc.) are also structured for fault tolerance. Each of the thruster groups and associated manifold valves is managed by one of the four redundant avionics strings. Layered on top of this already complex structure are the three electrical power buses, which distribute power throughout the system; the dual instrumentation system, which contains a number of the sensors that provide insight into certain aspects of system operation; and the displays and controls required for crew monitoring and management. The redundancy management logic must detect and isolate thrusters that are failed off, failed on, and leaking. Depending on the type of failure detected, the system must command appropriate manifold valves to prevent loss of propellant or any other dangerous condition.

Obviously, a compartmentalized approach to the redundancy management design would have been inadequate for this system. Even with the comprehensive approach, employed by the task group in an attempt to cover all aspects of system operation, the design has been repeatedly refined and augmented as ground test and flight experience uncovered obscure, unanticipated failure modes.

Home - NASA Office of Logic Design
Last Revised: February 03, 2010
Digital Engineering Institute
Web Grunt: Richard Katz

NASA Office of Logic Design

A scientific study of the problems of digital engineering for space flight systems, with a view to their practical solution.

NASA SP-504: Space Shuttle Avionics System

Section 3 Redundancy Management

A scientific study of the problems of digital engineering for space flight systems,
with a view to their practical solution.