With the end of Dennard scaling and Moore’s law, IC chips, especially large-scale ones, now face more reliability challenges, and reliability has become one of the mainstay merits of VLSI designs. In this context, this book presents a built-in on-chip fault-tolerant computing paradigm that seeks to combine fault detection, fault diagnosis, and error recovery in large-scale VLSI design in a unified manner so as to minimize resource overhead and performance penalties. Following this computing paradigm, we propose a holistic solution based on three key components: self-test, self-diagnosis and self-repair, or “3S” for short. We then explore the use of 3S for general IC designs, general-purpose processors, network-on-chip (NoC) and deep learning accelerators, and present prototypes to demonstrate how 3S responds to in-field silicon degradation and recovery under various runtime faults caused by aging, process variations, or radical particles. Moreover, we demonstrate that 3S not only offers a powerful backbone for various on-chip fault-tolerant designs and implementations, but also has farther-reaching implications such as maintaining graceful performance degradation, mitigating the impact of verification blind spots, and improving chip yield. This book is the outcome of extensive fault-tolerant computing research pursued at the State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences over the past decade. The proposed built-in on-chip fault-tolerant computing paradigm has been verified in a broad range of scenarios, from small processors in satellite computers to large processors in HPCs. Hopefully, it will provide an alternative yet effective solution to the growing reliability challenges for large-scale VLSI designs.
Author(s): Xiaowei Li, Guihai Yan, Cheng Liu
Publisher: Springer
Year: 2023
Language: English
Pages: 317
City: Singapore
Preface
Foreword
Contents
Acronyms
1 Introduction
1.1 Typical On-Chip Faults
1.1.1 Process Variation
1.1.2 Manufacturing Defects
1.1.3 Chip Aging
1.1.4 Soft Errors
1.1.5 Intermittent Faults
1.1.6 Emerging Technologies Induced Defects
1.2 Conventional Fault-Tolerant Chip Design Wisdom
1.2.1 Design for Test
1.2.2 Design for Diagnosis
1.2.3 Design for Reliability
1.3 Built-In Fault-Tolerant Computing Paradigm
1.3.1 Self-test
1.3.2 Self-diagnosis
1.3.3 Self-repair
1.3.3.1 Rejuvenation at the Circuit Level
1.3.3.2 Rejuvenation at the Microarchitectural Level
1.3.3.3 Rejuvenation at Architectural Level
1.3.4 General Benefits
1.3.4.1 Maintaining Graceful Degradation
1.3.4.2 Helping Fix Some Verification Blind Spots
1.3.4.3 Improving Gross Yield
1.4 Summary
References
2 Fault-Tolerant Circuits
2.1 On-Line Fault Detection
2.1.1 Challenges for On-Line Fault Detection
2.1.2 Stability Violation Based Fault Detection
2.1.2.1 Target Fault Types
2.1.2.2 Modeling Faulty Signals
2.1.3 Timing Constrains Exploration
2.1.3.1 Propagation of Stability Violation
2.1.3.2 XOR Protection
2.1.3.3 SEU Detection ``Blind Zone''
2.1.3.4 Available Precharge Period
2.1.4 On-Line Fault Detection Architecture
2.1.4.1 Circuit Design
2.1.4.2 Low-Overhead Deployment
2.1.4.3 Clock Variation Consideration
2.1.5 Experiment Result Analysis
2.1.5.1 Evaluating SVFD Unit
2.1.5.2 Case Study: An Application of SVFD
2.1.5.3 Comparison with Other Schemes
2.1.6 Discussion
2.1.6.1 On SVFD Application
2.1.6.2 Variation and Aging Considerations
2.1.6.3 Distinguish Detection Results
2.2 On-Chip Path Delay Measurement
2.2.1 Path Delay Measurement and Fault Tolerance
2.2.1.1 Challenges for Path Delay Measurement
2.2.1.2 Prior Path Delay Measurements
2.2.2 Path Delay Measurement Circuits
2.2.2.1 Basic Structure and Operation
2.2.3 Delay Range Calibration
2.2.4 Path Delay Measurement Architecture
2.2.4.1 Signal Transition Conversion (STC)
2.2.4.2 Delay Measurement
2.2.4.3 Delay Calibration for Import Lines
2.2.5 Experiment Result Analysis
2.2.5.1 Experiment I
2.2.5.2 Experiment II
2.2.5.3 Experiment III
2.2.5.4 Area and Timing Overhead
2.2.5.5 Comparison A
2.2.5.6 Comparison B
2.2.6 Discussion
2.3 Lifetime Fault-Tolerant Circuit Design
2.3.1 Aging Symptoms and Aging Sensors
2.3.2 Lifetime Fault-Tolerant Architecture
2.3.3 Self-adaptive Fault-Tolerant Pipeline
2.3.3.1 Timing Imbalance
2.3.3.2 Self-Adaptive Design Example
2.3.4 Self-adaptive Agent
2.3.4.1 Round-Robin Trial Adaptation (RRTA)
2.3.4.2 Agent Implementation
2.3.4.3 False Alarm Filter
2.3.4.4 Complexity Analysis and Two Critical Optimizations
2.3.4.5 Deploy Agents and Sensors
2.3.5 Architecture Implementation
2.3.5.1 Clock Generation and Overhead Analysis
2.3.5.2 ReviveNet-Supported Clock Gating
2.3.5.3 Implication of Multi-Cycle Paths
2.3.5.4 Impact of ReviveNet Wearout
2.3.6 Model Based Reliability Analysis
2.3.6.1 Reliability Model
2.3.6.2 Implication of TH
2.3.7 Case Study and Discussion
2.3.7.1 Experiment Setups
2.3.7.2 Results and Discussions
2.3.8 Discussion
2.4 Summary
References
3 Fault-Tolerant General Purposed Processors
3.1 Challenges of Fault-Tolerant Processor Design
3.1.1 Processor Vulnerability Characterizing
3.1.2 Sick Processor Management
3.2 Processor Vulnerability Evaluation
3.2.1 Vulnerability Analysis Methods
3.2.2 Intermittent Fault Oriented Analysis
3.2.2.1 Intermittent Stuck-at Faults
3.2.2.2 Intermittent Open and Short Faults
3.2.2.3 Intermittent Timing Faults
3.2.2.4 Statistical Significance
3.2.3 Experiment Result Analysis
3.2.3.1 Experiment Setups
3.2.3.2 IVF Computation for Different Intermittent Fault Models
3.2.3.3 IVF Computation for Different Microprocessor Configurations and Program Phases
3.2.3.4 IVF Guided Reliable Design
3.2.4 Discussion
3.3 Multi-Core Processor Salvaging
3.3.1 Dynamic Sick Core Ranking
3.3.1.1 Healthy Condition Definition
3.3.1.2 Snippet Definition
3.3.1.3 Snippet Characterization
3.3.1.4 Different Snippets Susceptible to Different Defects
3.3.1.5 Dynamic Healthy Condition Quantification
3.3.1.6 Validation of Healthy Condition (H)
3.3.1.7 Impact of Dynamic Management
3.3.1.8 Handling Failed Cores
3.3.2 Core Ranking Implementation
3.3.2.1 Classification
3.3.2.2 Deciding Design Parameters
3.3.2.3 Choosing Appropriate Hash Functions
3.3.2.4 Handling Sparsity of H
3.3.2.5 Hardware Overhead
3.3.3 Experiment Result Analysis
3.3.3.1 Experimental Setup
3.3.3.2 Workloads
3.3.3.3 Result Analysis
3.3.3.4 Comparing with Defect-Aware Scheme
3.3.3.5 Comparing with Heterogeneity-Aware Scheme
3.3.4 Discussion
3.4 Summary
References
4 Fault-Tolerant Network-On-Chip
4.1 Introduction to NoC Fault Tolerance
4.1.1 Fault-Tolerant NoC Architecture
4.1.2 Fault-Tolerant NoC Routing
4.1.3 Fault-Tolerant NoC Circuits
4.2 NoC Fault Tolerance with Topology Reconfiguration
4.2.1 NoC Topology Reconfiguration
4.2.1.1 Core-Level Redundancy in Homogeneous Manycore Processors
4.2.1.2 Topology Impacts on NoC-Based Manycore Systems
4.2.1.3 Physical Topology and Virtual Topology
4.2.2 NoC Topology Virtualization Formulation
4.2.3 NoC Topology Virtualization Optimization
4.2.3.1 TRP-I: An Instance of Quadratic Assignment Problem
4.2.3.2 TRP-II: An Instance of Vectorial Quadratic Assignment Problem
4.2.3.3 An Adopted Simulated Annealing Algorithm
4.2.3.4 Row Rippling Column Stealing Algorithm (RRCS)
4.2.3.5 RRCS-Guided Simulated Annealing Algorithm
4.2.4 Experiment Result Analysis
4.2.4.1 Experimental Setup
4.2.4.2 Experiment I
4.2.4.3 Experiment II
4.2.4.4 Experiment III
4.2.5 Discussion
4.3 NoC Fault Tolerance with Routing
4.3.1 Challenges of Fault-Tolerant NoC Routing
4.3.2 Preliminaries of Fault-Tolerant Routing
4.3.2.1 2-D Meshes
4.3.2.2 Turn Model
4.3.2.3 Odd-Even Turn Model
4.3.2.4 Fault Model
4.3.3 Defense Zones
4.3.4 ZoneDefense Routing Algorithms
4.3.5 Proof of Fault-Tolerant Routing
4.3.6 Experiment Result Analysis
4.3.6.1 Fault Model Comparison
4.3.6.2 Performance Analysis
4.3.6.3 Overhead Analysis
4.3.7 Discussion
4.4 NoC Fault Tolerance with Data Path Salvaging
4.4.1 Fault-Tolerant Router Architecture
4.4.2 Data Path Salvaging Implementation
4.4.3 Experiment Result Analysis
4.4.3.1 Area Overhead
4.4.3.2 Reliability
4.4.3.3 Performance
4.4.4 Discussion
4.5 Summary
References
5 Fault-Tolerant Deep Learning Processors
5.1 Introduction to Fault-Tolerant Deep Learning
5.1.1 Deep Learning Processor Basis
5.1.1.1 Typical 2D-Array Based Deep Learning Accelerator
5.1.1.2 ReRAM-Based DNN Computing
5.1.1.3 Neural Network Training Basis
5.1.2 Challenges of Fault-Tolerant Deep Learning
5.2 Fault-Tolerant Deep Learning Architecture
5.2.1 Deep Learning Sensitivity to Hardware Faults
5.2.2 Recomputing Based Hybrid Computing Architecture
5.2.3 HyCA Micro-Architecture
5.2.3.1 Fault Detection with HyCA
5.2.4 Experiment Result Analysis
5.2.4.1 Experiment Setup
5.2.4.2 Chip Area Overhead Comparison
5.2.4.3 Reliability Comparison
5.2.4.4 Performance Comparison
5.2.4.5 Redundancy Design Scalability Analysis
5.2.4.6 Fault Detection Analysis
5.2.5 Discussion
5.3 Online Fault Protection for ReRAM-Based Deep Learning
5.3.1 RRamedy Framework Overview
5.3.1.1 Design Goals
5.3.1.2 Target Fault Models
5.3.1.3 Design Requirements
5.3.2 Adversarial Example Testing on the Edge
5.3.3 Fault-Masking Retraining on the Cloud
5.3.4 In-Situ Model Remedy on the Edge
5.3.5 Experiment Result Analysis
5.3.5.1 Experiment Setup
5.3.5.2 Effectiveness of Adversarial Example Testing
5.3.5.3 Effectiveness of Offline Retraining
5.3.5.4 Effectiveness of Online Retraining
5.3.6 Discussion
5.4 Summary
References
6 Conclusion