NASA Office of Logic Design

NASA Office of Logic Design

A scientific study of the problems of digital engineering for space flight systems,
with a view to their practical solution.


Qualification by Test: An Example with Clock Skew

It is a fundamental design principle that inputs to flip-flops must be guaranteed to satisfy setup and hold times.  While propagation delay is important to this calculation, clock skew can not be ignored.  If the skew is such that the sink flip-flop is clocked late and that the data changes quickly, there will be a hold time violation.  Normally (but not always) use of a manufacturer's low-skew clocking resources makes this a non-issue.  However, many designers continue to improperly use high-skew signals as clocks for sequentially adjacent flip-flops clocked on the same edge without proper analysis.  The general form of this circuit is shown in Figure 1, below:

Figure 1.  Logical view of a shift register with a gate driving a net used for clocking.  Low-skew routing
resources are not used and hold time violations may occur if the analysis is not rigorously done.

 

While that looks like an excellent circuit on the schematic, a look at the electrical implementation of that circuit clearly shows this problem.  A simplified example, showing the principles, is presented in Figure 2.  Each resistance and capacitance has different values, not normally seen or controlled by the designer.

Figure 2. Electrical model (simplified) of two sequentially adjacent
flip-flops with the clock signal on a high-skew routing resource.

 

VHDL and Verilog synthesizers are more than happy to generate these (and worse) circuits; this is not a problem related to schematic entry.  For example, Listing 1 shows a circuit that synthesized into a horrible circuit, with a buffer tree inserted into the clock path:  Figure 3 shows the results of synthesis.

Listing 1

Library IEEE;
Use IEEE.Std_Logic_1164.All;
Entity Skew Is
Port ( Clk : In Std_Logic;
       D   : In Std_Logic;
       Q   : Out Std_Logic );
End Skew;

Library IEEE;
   Use IEEE.Std_Logic_1164.All;

Architecture Skew of Skew Is
   Signal ShiftReg : Std_Logic_Vector (31 DownTo 0);
   Begin

P: Process ( Clk )
Begin
   If Rising_Edge (Clk)
      Then Q <= ShiftReg(0);
           ShiftReg (30 DownTo 0) <= ShiftReg (31 DownTo 1);
           ShiftReg (31) <= D;
   End If;
End Process P;

   End Skew;

Figure 3.  Results of VHDL synthesis.  This circuit will most likely have an
unacceptable amount of clock skew when implemented in an FPGA.

 

Now, properly analyzing circuits of this class is a non-trivial exercise.  It is only semi-automated, time consuming, and error prone.  Usually a proper analysis will show that the circuit is in the gray area; it may work or it may not.  Thus, one may be tempted to qualify the circuit by test.  Furthering arguments to support circuits of this class is that delays will "track" and that "the mathematics needs to be thrown out."  Of course, I am not making these rationalizations up!  The latter was just heard in a key meeting supporting a launch decision "justifying" discarding worst-case analysis results just yesterday. (June 7, 2002).

A close look at the simplified electrical model shown in Figure 2 clearly shows that there are two paths and proper operation of the circuit is determined by the difference in the path delays.  Since the real setup and hold times of flip-flops are quite small, even minor changes in the relative propagation delays of the two circuit legs can move the circuit from the operating state to a failing one.

A proper analysis showing that the delays will "track" must be very accurate and take into account all effects.  These include degradation over life, temperature effects, radiation, etc.  For life, showing that the delays will track is difficult, as there is little good data on it.  Data that we do have shows that the tracking assumption is invalid.  Figure 4 presents data from the qualification program of the RH1280.

Figure 4.  Change in propagation delay for RH1280 after 1000 hour life test, 4.5 volts, 125 °C.
Note: The binning circuit is used, a long path, 16 modules + I/O, with tP exceeding 100 ns.

What is often found when circuits with these logic structures are built is a "yield problem."  First, when antifuse-based devices are programmed, there is normally a dropout of a few percent on the programmer.  That is normal and is not a problem.  However, if a part successfully programs -- and that process includes automated verification tests -- then there should be zero dropout in the system.  I have seen cases where programs have closed failure reports since only 1 of n parts failed in the system and attributed this to "normal dropout."  These reports have been reopened and virtually every time the root cause of failure was a design error.  It is fairly common to see outputs stuck at a particular level and the part falsely declared failed.  Unless that is well-proven, this "engineering by arm-waving" should be rejected as this will often result in systems that are ticking time bombs.  See the data in Figure 4, above, for an example of this.

Now, how can one explain the yield issue?  In antifuse-based FPGAs, the resistance of a programmed antifuse is not a well-controlled variable.  After the antifuse (essentially a capacitor) is "cracked" by a relatively high voltage pulse (around 10 to 20 volts, depending on the technology), a number of programming pulses are applied until the resistance drops to within programmed limits.  These limits are programmed into the timing analysis models to support min-max analysis.  Clearly, each individual FPGA will be different from others, even those from the same wafer lot.  As such, one would expect to see a distribution of antifuse resistance.  Figure 5, below, shows exactly that.

Figure 5. A sample distribution of ONO antifuse resistance with a programming
current of 5 mA.  From "Antifuse FPGAs," J. Greene, et. al.

 

Circuits of this class are extremely sensitive to small changes of conditions, a fact that has been demonstrated in the laboratory on numerous occasions.  For example, a recent failure attributed to clock skew showed the following:

Now, for those who claim that circuit parameters track, note that the difference in path delays for a change of a few hundredths of a volt is small; yet it represents the difference between success and failure.  As seen, the effects of life on propagation delay do not track; they are not even guaranteed to have the same sign.  There is a similar concern with long-term reliability as a function of radiation, which will not be uniform in space over a die.  Note that Co-60 radiation testing in ground chambers has different distribution properties then the true environment.  For temperature, one must take into account the temperature coefficient of all circuit elements in the particular delay chains: antifuses, capacitance, tPD for transistors, etc.  These components are not normally visible to the designer or analyst.

Voltage margin testing on other asynchronous circuits have shown "singularities," where failures will occur over a small range of voltages, passing both above and below that band.  This makes failure detection by environmental testing difficult.

For the same failing circuit above, Table 1 shows laboratory test results of temperature sensitivities.

Table 1

Temperature (°C) VCC where failures are observed
23 4.82
25 4.84
28 4.88
32 4.92
46 5.00
55 5.07

 

Conclusion

Showing design margins by test demonstrated on the ground can not be used to predict reliability on orbit for this class of circuit.  For other classes of circuits, such as the change of propagation delay between two clock edges of a crystal clock oscillator, margin testing can have some value.

Showing design margin by logic simulation can not be used to predict reliability on orbit for this class of circuit.  Most logic simulators switch models between runs -- min, typ, and max -- and are incapable of performing min-max analysis.  The simulation algorithms assume that the variable parameters track.  As seen in Figures 4 and 5, for example, showing the effects of life and antifuse resistance, this is not the case.  Real radiation environments are also a concern.  The "tracking" assumption is simply wrong and is no more than "engineering by arm waving."


Home - NASA Office of Logic Design
Last Revised: July 01, 2003
Digital Engineering Institute
Web Grunt: Richard Katz
NACA Seal