# Analytical Derivation of the Reliability Metric for Digital Circuits

A. Bystrov and M. A. Abufalgha

Newcastle University

# 1 Summary of the method

Two traditional approaches to evaluation of digital circuit reliability are Monte Carlo simulations and physical testing of a prototype, both being quite expensive and unsuitable for circuit optimisation in the course of logic synthesis. Therefore, a new method is proposed, which is based on two levels of characterisation: the platform-level stochastic interference model and the circuit-level model for "translation" of the former model into the reliability metric of the digital circuit. The platform-level interference model is fixed for a design library and environmental conditions. For example, it may include a probability density function (PDF) of neutron energy, and a model of the current pulse in the transistor as a function of the neutron energy, transistor size, type, source-drain voltage, temperature, etc. Its purpose is to represent the interference, possibly expressed in non-electrical terms (e.g. particle energy distribution), as electrical effects (e.g. pulses of current having their magnitude, duration and arrival time stochastically described). This is done just once and is universal for every block in a SoC.

The translation model is the core idea of the method. This model converts the stochastic description of the electrical interference, e.g. the current pulse caused by neutron strike, into the probability of error at the circuit output. This is done by finding the critical values for the interference parameter, e.g. the parameters of the above current pulse, beyond which the parameter causes an error, e.g. an incorrect output value written into a flip-flop. The critical values are found by a series of analogue simulation runs on the circuit, but not the Monte Carlo method. Then, in the knowledge of the critical values of the interference parameter, it becomes possible to analytically recalculate the stochastic model of the interference into the probability of an output error or correct operation (reliability).

This method can be combined with the analysis of performance and energy consumption of a circuit, thus contributing to the methodology of energy-modulated computing, whose major problem is provision of reliable operation under randomly modulated, i.e. unreliable, power supply. First results of the proposed method are obtained. They show that a complex tradeoff exists between energy, performance and reliability of digital circuits, and that the traditional dynamic voltage-frequency scaling can be improved by taking the reliability into account.

### 2 Circuit under test

In this paper only a simple form of a combinational circuit is considered – a long chain of inverters. It is intended to mimic a single path through an arbitrary logic circuit used as a part of a synchronous clocked automaton operating under voltage-frequency scaling. The frequency is chosen as a performance metric. It is determined for each value of the voltage supply Vdd by simulating the circuit and measuring the propagation delay, no margins added. The circuit includes 205 identical inverters implemented with UMC 90nm foundry design kit, all transistors are 80nm in length (standard for this library), pull-down transistor is 400nm, pull-up is 800nm (these values as similar to those used in a commercial standard-cell library), standard threshold voltage, standard use Vdd = 1V. Between the inverters there are wires, whose parasitic capacitance we simulate as 2fF capacitors (typical capacitance of a short interconnect wire). In our experiments we estimate the reliability of only four inverters in this long chain, as illustrated in Figure 1, then show that the values for all of them are very similar, while a minor difference is observed only in the last stage. Therefore, the reliability of all inverters in the path, except the last one, can be accepted to be the same.



Figure 1: Circuit under test

### 3 Fault model

A strike of a neutron is chosen as the cause of faults in our example. The neutron penetrates silicon and may collide with an atom, thus producing secondary charged ions and, eventually, the holes and electrons around the sensitive part of a transistor, known as *error zone* [15]. The holes and the electrons injected in the material around a transistor flow towards the PN junctions, recombine and create a current pulse, which in the circuit of a logic gate, to which the transistor in question belongs, presents itself as a pulse of voltage at the gate output. This is a very crude overview of a complex physical process studied in

[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]. The flow of neutrons can be modelled as a stochastic energy distribution by a classical Maxwell-Boltzmann PDF and the rate [15].

After the model for the primary cause of errors have been specified, one has to "translate" its model into the pulses of voltage on wires. In this work a method described in [16] is used. In this method the effect of a neutron strike is modelled as a dependent current source included into a BSIM4 Spice model of a MOSFET transistor. A result of application of this method is a number of families of waveforms for the voltage at the output of an inverter for a range of Vdd voltages and a range of values of neutron energy. The particle energy is expressed as a metric of linear energy transfer (LET) [1,11,12,13,14,15]; this is because we are interested not in the neutrons themselves, but rather in the effect of their interaction with the transistor. The pulses of voltage at the gate outputs caused by neutron strikes are called single event transients (SET), because they are temporary logic errors resolving themselves after a short interval of time. The simulated families of SETs for different LET and Vdd is shown in Figure 2.



Figure 2: Two families of SETs for different LET and Vdd values

Constructing a fault model is an important stage of the reliability estimation method, because this model provides the primary information which is subsequently converted into the probabilities of error or absence of such, i.e. the reliability. As one can see, this model is specific to the technology [15] and the stochastic description of the neutron flux. In the same time, it is not aware of the logic circuit constructed with the gates. Therefore, this is the platform-level stochastic interference model, a characterisation stage performed just once for a given technology library and radiation conditions, and is not repeated for each particular design utilising the technology library. This stage is expensive, because it requires conducting physical experiments in order to determine various parameters involved in modelling the SET as in [15].

# 4 Analytical calculation of reliability

This section describes the core of the method which does not require Monte Carlo simulations for gaining statistics on the output errors. Instead, the stochastic fault model is converted into the reliability value through the properties of a circuit.

The first objective of this stage is to find whether an SET (e.g. on of those shown in Figure 2) would cause an output error of the whole circuit comprising multiple gates (a long chain of inverters in our example) or not. The second objective is to calculate the probability of error-free operation or reliability.

An output error is defined as a *single event upset* (SEU) [1,2,17], which is an effect of an SET if the latter becomes latched in a flip-flop connected to the output of the combinational circuit with the SET on it. A difficulty here is that not all SETs result in an SEU. Some SETs disappear before the clock signal, or appear too late w.r.t. it. Furthermore, the magnitude of an SET may be below the threshold of the flip-flop sampling the output, its duration may be insufficient or it may disappear while propagating through the path due to individual stages exhibiting *inertial delay* behaviour, and suppressing the short duration pulses.

The first objective is achieved by identifying a vector of parameters of the *interference* (in this experiment it is a SET characterised with two parameters – the LET and arrival time) and simulating the circuit in order to determine the critical values of this vector, which separate the erroneous from error-free behaviour at the output. We repeat this for different Vdd and arrival time values in order to see how reliability changes under voltage-frequency scaling (the clock period is adjusted to the propagation delay under each Vdd value).

In Figure 3 the critical values of the interference vector are displayed for Vdd=1V and the faulty stage number 101; the clock period defined as a propagation delay without any margins is 4.06ns. It is easy to adjust the results to any timing margins used in a particular design, but it is not included in this paper. For the other stages in the path the diagrams are very similar, just shifted left for the low stage numbers and right for the high numbers.

The second objective is achieved by using the graph in Figure 3 to calculate the probability  $P_{err}$  of the system being in the error zone. For this we use the



Figure 3: Critical values of the interference vector

PDF function  $f_x$  for LET  $x_{LET}$  and the PDF function  $f_t$  for SET arrival time  $t_a$ ; the former known from the fault model, the latter having uniform distribution due to asynchronous nature of SET events.  $P_{err}$  in (1) is calculated for a single clock cycle.

$$P_{err} = \frac{\iint_{error\,zone} f_x(x_{LET}) \cdot f_t(t_a) \cdot dx_{LET} \cdot dt_a}{\int_{t=0}^{T} \int_{e_{LET}=0}^{\infty} f_x(x_{LET}) \cdot f_t(t_a) \cdot dx_{LET} \cdot dt_a}.$$
 (1)

The integrals in (1) are computed numerically. Note, the PDF of the arrival time is constant, i.e.  $f_t(t_a) = 1/T \cdot r_{SET}$ , where T is the clock period, and  $r_{SET}$  is a constant representing SET rate. Instead of the infinite integration limit for  $e_{LET}$  we choose 100, as the probability of exceeding this limit is negligible [18,19]. The PDF for LET is defined as Maxwell–Boltzmann formula (2).

$$f_x = \sqrt{\frac{2}{\pi}} \cdot \frac{x_{LET}^2 e^{-x_{LET}^2/(2a^2)}}{a^3},\tag{2}$$

with the constant a a calculated as 25.06.

This is for the probability of error when an SET is injected in the stage 101 of the path. The same procedure was repeated for the other stages, and the computed figures were the same apart from the last five stages, where the probability of error was gradually reduced towards the end, and the last stage produced 10%-20% lower error probability (depending on Vdd). For low SET rates it is reasonable to assume that not more than a single SET can take place in the path in any particular clock cycle, which leads to the formula (1) being applicable to the path error, and  $r_{SET}$  becomes the SET rate in the path. The reliability can be calculated as absence of error, i.e.  $P_{reliability} = 1 - P_{err}$ .

### 5 Results

The above method was applied under a range of Vdd values, the error probabilities were calculated and plotted in Figure 4(a). The SET rate was chosen as  $r_{SET}=20h^{-1}$ , i.e. 20 neutrons per hour hitting one of the inverters in the path, which is abnormally high, as such a rate is usually applied to the whole chip rather than a small circuit. It is interesting that the error probability is reduced if SET is injected in the last stage. This is an effect of the SET expanding when propagating along the path. This expansion only happens when SET is long, i.e. the LET causing it is high. There is no path attached to the last stage, hence no expansion, and lower error probability as a result. It is seen in Figure 4(b), which plots the transient pulse duration for a range of Vdd values and point of SET injection.





Figure 4: Error probability and transient pulse duration vs. Vdd

Note that in these diagrams the probability of error is calculated per a single clock cycle, rather than per second of operation. This metric is relevant to completion of fixed computational tasks. If this metric is changed to the probability of error per unit of time, then the figures for low voltages will look by far better – it is a common oversight in low-power design.

A 3D diagram in Figure 5 depicts a three way trade-off between energy, reliability and performance, which is one of main results in this paper. It shows that in the low-energy corner both the reliability and performance drop rapidly, which results in a recommendation to avoid this corner. A similar diagram can be generated for any design and without lengthy Monte Carlo simulations and used for selecting an operating point.



Figure 5: Energy-reliability-performance tradeoff

# 6 Conclusions

Two main achievements reported in this paper are a new method of analytical derivation of reliability metric for digital circuits, and a tree-way energy-reliability-performance tradeoff demonstrated by by the above method.

The reliability metric is derived without extremely expensive Monte Carlo simulations or physical experiments, which makes its inclusion into ECAD logic synthesis tools possible. This method is possibly a future enabler for achieving the reliability closure on a system at an early design stage, similar to how the timing closure is addressed.

The method includes two stages. At the first stage the technology library is characterised under a chosen interference model, e.g. a neutron flux with a particular energy distribution, and then "translated" into the electrical domain as an SET model. This is done just once and not repeated for each circuit in the project. At the second stage critical values for the vector of interference parameters are derived for a particular circuit under test by a limited number of simulations. The critical values are the border between the erroneous and error-free operation. Then, the probability of error or absence of it, i.e. the reliability, is calculated.

The explored three-way tradeoff is extending the traditional static or dynamic voltage-frequency scaling concepts by adding the reliability metric. It will help to select the operating point for circuits. It is also an enabler for a new generation of power management which controls the reliability dynamically – power or energy reliability management, PRM or ERM.

#### References

- D. A. Black, W. H. Robinson, I. Z. Wilcox, D. B. Limbrick, and J. D. Black, "Modeling of Single Event Transients With Dual Double-Exponential Current Sources: Implications for Logic Cell Characterization," IEEE Transactions on Nuclear Science, vol. 62, pp. 1540-1549, 2015.
- 2. S. Sayil, A. Shah, M. Zaman, and M. Islam, "Soft Error Mitigation using Transmission Gate with varying Gate and Body Bias," IEEE Design & Test, vol. PP, pp. 1-1, 2015.
- 3. V. Ferlet-Cavrois, L. W. Massengill, and P. Gouker, "Single Event Transients in Digital CMOS—A Review," IEEE Transactions on Nuclear Science, vol. 60, pp. 1767-1790, 2013.
- 4. Z. Bin, W. Wei-Shen, and M. Orshansky, "FASER: fast analysis of soft error susceptibility for cell-based designs," in Quality Electronic Design, 2006. ISQED '06. 7th International Symposium on, 2006, pp. 6 pp.-760.
- N. Miskov-Zivanov and D. Marculescu, "MARS-C: modeling and reduction of soft errors in combinational circuits," in 2006 43rd ACM/IEEE Design Automation Conference, 2006, pp. 767-772.
- 6. R. R. Rao, K. Chopra, D. T. Blaauw, and D. M. Sylvester, "Computing the Soft Error Rate of a Combinational Logic Circuit Using Parameterized Descriptors," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 26, pp. 468-479, 2007.
- 7. M. Nicolaidis, "Soft Errors in Modern Electronic Systems," 2011.
- 8. B. Narasimham, M. J. Gadlage, B. L. Bhuva, R. D. Schrimpf, L. W. Massengill, W. T. Holman, et al., "Characterization of Neutron- and Alpha-Particle-Induced Transients Leading to Soft Errors in 90-nm CMOS Technology," Device and Materials Reliability, IEEE Transactions on, vol. 9, pp. 325-333, 2009.
- 9. R. Liu, A. Evans, Q. Wu, Y. Li, L. Chen, S. J. Wen, et al., "Analysis of advanced circuits for SET measurement," in Reliability Physics Symposium (IRPS), 2015 IEEE International, 2015, pp. SE.7.1-SE.7.7.
- M. Ebrahimi, A. Evans, M. B. Tahoori, E. Costenaro, D. Alexandrescu, V. Chandra, et al., "Comprehensive Analysis of Sequential and Combinational Soft Errors in an Embedded Processor," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 34, pp. 1586-1599, 2015.

- 11. D. Munteanu and J. L. Autran, "Modeling and Simulation of Single-Event Effects in Digital Devices and ICs," Nuclear Science, IEEE Transactions on, vol. 55, pp. 1854-1878, 2008.
- 12. S. M. Jahinuzzaman, "Modeling and Mitigation of Soft Errors in Nanoscale SRAMs," 2008.
- 13. L. Artola, M. Gaillardin, G. Hubert, M. Raine, and P. Paillet, "Modeling Single Event Transients in Advanced Devices and ICs," Nuclear Science, IEEE Transactions on, vol. 62, pp. 1528-1539, 2015.
- 14. G. I. Wirth, M. G. Vieira, E. H. Neto, and F. G. L. Kastensmidt, "Single Event Transients in Combinatorial Circuits," in 2005 18th Symposium on Integrated Circuits and Systems Design, 2005, pp. 121-126.
- J.-L. Autran and D. Munteanu, "SOFT ERRORS FROM PARTICLES TO CIR-CUITS," 2015.
- 16. J. S. Kauppila, A. L. Sternberg, M. L. Alles, A. M. Francis, J. Holmes, O. A. Amusan, et al., "A Bias-Dependent Single-Event Compact Model Implemented Into BSIM4 and a 90 nm CMOS Process Design Kit," Nuclear Science, IEEE Transactions on, vol. 56, pp. 3152-3157, 2009.
- 17. M. Slimani and L. Naviner, "A tool for transient fault analysis in combinational circuits," in 2015 IEEE International Conference on Electronics, Circuits, and Systems (ICECS), 2015, pp. 125-128.
- C. Geng, J. Liu, Z.-G. Zhang, K. Xi, S. Gu, M.-D. Hou, et al., "Modeling the applicability of linear energy transfer on single event upset occurrence," Chinese Physics C, vol. 37, p. 066001, 2013.
- 19. R. Doering and Y. Nishi, "Handbook of semiconductors manufacturing technology," 2008.