# Coping with Soft Errors in Asynchronous Burst-Mode Machines

Sobeeh Almukhaizim Computer Engineering Dept. Kuwait University, Kuwait



Feng Shi & Yiorgos Makris Electrical Engineering Dept. Yale University, USA



## Sources of Soft Errors

"Solar Particles" Affect satellites; may also penetrate to Earth



"Galactic Particles" Are high-energy particles that penetrate to Earth's surface, through buildings and walls

- High-energy particles collide with silicon atoms
- Collision generates a voltage pulse at impact site

• Under certain conditions, it may produce a soft error

### Frequency of Soft Errors



 Integrated circuits (synchronous & asynchronous) will require methods to tolerate / mitigate soft errors and ensure reliability

# Soft Error Tolerance & Mitigation in ASYNC

- Previous studies targeted Quasi Delay-Insensitive (QDI) circuits
- SEU-tolerant QDI circuits (W. Jang & A. Martin, ASYNC, 156-165, 2005):
  - Gate-level fine-grain duplication and double-checking



### **Asynchronous Burst-Mode Machines**



ASYNC'08

# Coping with Soft Errors in ABMMs



## TMR-based Soft Error Tolerance in ABMMs



- C-element used as majority voter
- Strikes at state-line C-elements not tolerated

4/11/2008

## **Duplication-based Soft Error Tolerance**

• Observation: 2-input C-elements are sufficient to tolerate one failing module (i.e., only one replica is needed)



Strikes at state-line C-elements still not tolerated

# **Tolerating Errors on State-Line C-Elements**

Proposed Solution: cross-coupled structure of C-elements



4/11/2008

 $\bullet$ 

### Example

#### **3. Insert dratteinteintents**nts



10

## **Experimental Results**

#### Duplication-based Soft Error Tolerance

| Circuit Name  | I/S/O   | Original | Duplicate | <b>C-elements</b> | Total | Overhead |
|---------------|---------|----------|-----------|-------------------|-------|----------|
| hp-ir         | 3/1/2   | 8        | 8         | 18                | 34    | 325.00%  |
| concur-mixer  | 3/2/3   | 16       | 16        | 33                | 65    | 306.25%  |
| tangram-mixer | 3/1/2   | 10       | 10        | 18                | 38    | 280.00%  |
| rf-control    | 6/3/5   | 37       | 37        | 51                | 125   | 237.84%  |
| while_concur  | 4/2/3   | 24       | 24        | 33                | 81    | 237.50%  |
| barcode       | 13/4/17 | 172      | 172       | 99                | 443   | 157.56%  |
| p2            | 8/4/16  | 192      | 192       | 96                | 480   | 150.00%  |
| p1            | 13/4/14 | 238      | 238       | 90                | 566   | 137.82%  |

<u>Area overhead seems excessive for small circuits</u>: cost inflated due to proportionately large number of C-elements over logic gates, and the rather expensive C-element implementation used

# Coping with Soft Errors in ABMMs



# Soft Error Susceptibility Estimation

- A hazard-aware asynchronous fault simulator is needed (SPIN-SIM: F. Shi and Y. Makris, ITC, 597-606 (2004))
- Fault simulate & construct a soft error susceptibility table (sest)
- Asymmetric softwarror susceptibility tof gates in different levels
  - Enables judicious selection and replication in a partial duplicate

| SIB, Mohan       | 11000<br>ram and N. A | A. TOUDA, ITC | ). <u>893-901 (</u> 2 | 0030001 |
|------------------|-----------------------|---------------|-----------------------|---------|
| SIB <sub>2</sub> | 01001                 | 11001         | · · ·                 | 11001   |
| :                |                       |               |                       |         |
| SIB <sub>m</sub> | 11001                 | 00010         |                       | 00000   |

$$susc(G_q) = \frac{\sum_{i=1}^{m} \sum_{j=s+1}^{s+k_q} E(sest[i,j])}{m \cdot k_q}, \text{ where } s = \sum_{l=1}^{q-1} k_l$$
$$SER(ABMM) = \sum_{q=1}^{\infty} sest(G_q)$$

ASYNC'08

## **Duplication of Sensitive Gates**



4/11/2008

#### **Duplication of Complete Sensitive Logic Cones**



# **Duplication of Partial Sensitive Logic Cones**



### **Experimental Results**

#### 2-level ABMMs



Achieved tolerance is commensurate with the area overhead

• The partial logic cones mitigation method is consistently better

### **Experimental Results**

#### Multi-level ABMMs (new release by Columbia Univ.)



Multi-level implementation significantly improves the tradeoff between area overhead & achieved soft error tolerance

4/11/2008

### Summary

#### • Soft error tolerance in ABMMs

- Duplication-based solution that improves upon TMR
- Cross-coupled C-element structure for state-line protection

#### Soft error mitigation in ABMMs

- Enables exploration of the trade-off between the achieved soft error tolerance and the incurred area overhead
- Driven by soft error susceptibility estimation via hazard-aware asynchronous fault simulator (SPIN-SIM)
- → Yields 3 progressively more powerful partial duplication options