#### Fault Tolerance with

## Microarchitecture-Based Introspection

Moinuddin K. Qureshi

Onur Mutlu

Yale N. Patt

Department of Electrical and Computer Engineering The University of Texas at Austin





















Can cause a fully verified, correct design to perform incorrectly.



The rate of transient faults is expected to increase significantly.
Processors will need some form of fault tolerance.

However, different applications have different reliability requirements (e.g. server-apps v/s games). Users who do not require high reliability may not want to pay the overhead.

Fault tolerant mechanisms with low hardware cost are attractive because they allow the designs to be used for a wide variety of applications.



#### **Background on Fault Tolerance**

- Storage structures are easy to protect with parity or ECC.
- Logic structures often need redundant execution:
  - Space redundancy
  - Time redundancy
- Space redundancy has high hardware overhead.
- Time redundancy has low hardware overhead but high performance overhead.

The performance overhead of time redundancy can be reduced if redundant execution is performed during idle periods.

- A long-latency miss typically stalls instruction processing.
- A significant proportion of execution time is spent waiting for memory.



- A long-latency miss typically stalls instruction processing.
- A significant proportion of execution time is spent waiting for memory.





- A long-latency miss typically stalls instruction processing.
- A significant proportion of execution time is spent waiting for memory.



Redundant execution can leverage load values and branch prediction from primary execution.



- A long-latency miss typically stalls instruction processing.
- A significant proportion of execution time is spent waiting for memory.



#### There is enough idle time to re-execute the application!

#### **Motivation**



#### **Motivation**



# Can we design a time redundancy technique that utilizes memory waiting time for redundant execution?

The International Conference on Dependable Systems and Networks - 2005

#### Outline

#### Introduction

- Concept of Introspection
- **L** Implementation of MBI
- **•** Evaluation of MBI
- **Q** Summary









































Microarchitecture-Based Introspection (MBI)



#### Outline

#### Introduction

- Concept of Introspection
- Implementation of MBI
- **•** Evaluation of MBI
- **Q** Summary



#### **Additional Structures**



#### **Additional Structures**



Two modes: Performance and Introspection

Four cases:

- Operation in performance mode
- Switching from performance mode to introspection mode
- Operation in introspection mode
- Switching from introspection mode to performance mode













## **Switching from Performance Mode to Introspection Mode**

- Enter Introspection on:
  - Long latency L2 miss (wait for 30 cycles)
  - Backlog buffer full
  - System call or Interaction with I/O

- The process involves:
  - Flushing the pipeline
  - Switching the mode bit
  - Fetching instructions from the PC of ISPEC-ARF



### **Operation in Introspection Mode**



### **Operation in Introspection Mode**


## **Operation in Introspection Mode**



## **Operation in Introspection Mode**



## **Switching from Introspection Mode to Performance Mode**

- Exit Introspection mode when:
  - L2 miss is serviced
  - Backlog buffer is empty

- The process involves:
  - Flushing the pipeline
  - Switching the mode bit
  - Fetching instructions from the PC of PERF-ARF



#### Outline

#### Introduction

- Concept of Introspection
- Implementation of MBI
- Evaluation of MBI
- **Q** Summary



## Methodology

- Pipeline: 20 stages, 8-wide issue with out-of-order execution
- L2 cache: Unified, 1MB, 15-cycle hit
- Memory: 400 cycles (32 banks)
- Prefetcher: Streaming with up to 32 streams
- Backlog Buffer contains 2k entries
- Benchmarks: SPEC CPU2000



#### **Performance Overhead**

- Forced-Introspection penalty
- Pipeline-Fill penalty



#### **Performance Overhead**

- Forced-Introspection penalty
- Pipeline-Fill penalty

UTEECE



The International Conference on Dependable Systems and Networks - 2005

## **Other Overheads: Storage and Error-Detection Latency**

#### • Storage overhead:

Depends on the Number of Entries in the Backlog Buffer

| Num. Entries | Storage Cost | IPC overhead |
|--------------|--------------|--------------|
| 1k           | 8.5kB        | 16.1%        |
| 2k           | 16.5kB       | 14.5%        |
| 4k           | 32.5kB       | 13.9%        |

- Error Detection Latency
  - Average: 682 cycles
  - Worst case: 36183 cycles (impact depends on the <u>error correction mechanism</u>)

#### Outline

#### Introduction

- Concept of Introspection
- **L** Implementation of MBI
- **•** Evaluation of MBI
- **Q** Summary



#### **Summary**

- Transient fault rate is increasing. Processors will need some form of fault tolerance.
- Fault tolerant mechanisms with low hardware cost are attractive because they allow the designs to be used for a wide variety of applications.
- MBI utilizes the idle time during long latency cache misses for redundant execution. MBI has low hardware overhead and provides coverage for the entire pipeline.
- The time redundancy of MBI incurs an average IPC overhead of 14.5%.



# Questions



## **Extra Slides**





MBI is primarily a fault detection mechanism. Possible recovery mechanisms include fail-safe and checkpointing. Can also use undo mechanism...

Recover register state by copying ISPEC-ARF to PERF-ARF Recover memory state by undoing younger store operations.

















































MBI is primarily a fault detection mechanism. Possible recovery mechanisms include fail-safe and checkpointing. Can also use undo mechanism...



No effect of the error causing instruction or any later instruction is architecturally visible.



MBI is primarily a fault detection mechanism. Possible recovery mechanisms include fail-safe and checkpointing. Can also use undo mechanism...



Needs special handling in case of multiprocessors. Data about error detection latency is <u>here</u>



- **Dual modulo redundancy:** [e.g. IBM-z390, Compaq-Himalaya]
  - High hardware cost



- **Dual modulo redundancy:** [e.g. IBM-z390, Compaq-Himalaya]
  - High hardware cost
- **DIVA** [Austin MICRO'99]
  - Needs a checker processor substantial investment
  - Does not provide full coverage for the pipeline



- **Dual modulo redundancy:** [e.g. IBM-z390, Compaq-Himalaya]
  - High hardware cost
- **DIVA** [Austin MICRO'99]
  - Needs a checker processor substantial investment
  - Does not provide full coverage for the pipeline
- Software based techniques: [EDDI, SWIFT, etc.]
  - May need recompilation (not always possible)
  - Cannot take advantage of run-time information (e.g. cache miss)



- **Dual modulo redundancy:** [e.g. IBM-z390, Compaq-Himalaya]
  - High hardware cost
- **DIVA** [Austin MICRO'99]
  - Needs a checker processor substantial investment
  - Does not provide full coverage for the pipeline
- Software based techniques: [EDDI, SWIFT, etc.]
  - May need recompilation (not always possible)
  - Cannot take advantage of run-time information (e.g. cache miss)
- SMT based techniques: [e.g. RMT ISCA'00]
  - Need a fine-grained multi-threaded machine
  - Reduces throughput
















## **Breakdown of Introspection Episodes**





## **Coverage of Different Types of Introspection**



