By Arjun Chaudhuri, Duke University, USA, arjun.chaudhuri@duke.edu | Ching-Yuan Chen, Duke University, USA, chingyuan.chen@duke.edu | Krishnendu Chakrabarty, Arizona State University, USA, krishnendu.chakrabarty@asu.edu
Emerging device technologies such as silicon photonics, nonvolatile memories, and heterogeneous monolithic 3D (M3D) integration are being explored as post-Moore’s law alternatives for achieving high-density integration of many-core AI accelerators. In addition to innovations at the device level, architectural optimizations are also being carried out to achieve high-performance processing of large AI workloads with custom accelerator hardware. Systolic array-based inferencing accelerators achieve higher throughput and improved energy efficiency compared to CPUs and GPUs because of the homogeneous and regular data flow in systolic arrays. However, the performance of such emerging AI accelerators can be adversely affected by faults due to process variations, manufacturing defects, and aging. In this monograph, we analyze the performance of several emerging AI accelerators in the presence of different uncertainties and present low-cost methods to assess the significance of faults and mitigate their effects. We show that across all technologies, the functional criticality of faults can vary significantly based on the fault type, fault location, and the application workload. The fault criticality assessment and mitigation techniques presented in this monograph are necessary for enabling low-cost test, diagnosis, and design of robust AI accelerators.
The rapid growth in big data from mobile, Internet of things (IoT), and edge devices, and the continued demand for higher computing power, have established deep learning as the cornerstone of most artificial intelligence (AI) applications today. Recent years have seen a push towards deep learning implemented on domain-specific AI accelerators that support custom memory hierarchies, variable precision, and optimized matrix multiplication. Commercial AI accelerators have shown superior energy and footprint efficiency compared to GPUs for a variety of inference tasks.
In this monograph, roadblocks that need to be understood and analyzed to ensure functional robustness in emerging AI accelerators are discussed. State-of-the-art practices adopted for structural and functional testing of the accelerators are presented, as well as methodologies for assessing the functional criticality of hardware faults in AI accelerators for reducing the test time by targeting the functionally critical faults.
This monograph highlights recent research on efforts to improve test and reliability of neuromorphic computing systems built using non-volatile memory (NVM) devices like spin-transfer-torque (STT-MRAM) and resistive RAM (ReRAM) devices. Also are the robustness of silicon-photonic neural networks and the reliability concerns with manufacturing defects and process variations in monolithic 3D (M3D) based near-memory computing systems.