Accelerators for Convolutional Neural Networks

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Accelerators for Convolutional Neural Networks Comprehensive and thorough resource exploring different types of convolutional neural networks and complementary accelerators Accelerators for Convolutional Neural Networks provides basic deep learning knowledge and instructive content to build up convolutional neural network (CNN) accelerators for the Internet of things (IoT) and edge computing practitioners, elucidating compressive coding for CNNs, presenting a two-step lossless input feature maps compression method, discussing arithmetic coding -based lossless weights compression method and the design of an associated decoding method, describing contemporary sparse CNNs that consider sparsity in both weights and activation maps, and discussing hardware/software co-design and co-scheduling techniques that can lead to better optimization and utilization of the available hardware resources for CNN acceleration. The first part of the book provides an overview of CNNs along with the composition and parameters of different contemporary CNN models. Later chapters focus on compressive coding for CNNs and the design of dense CNN accelerators. The book also provides directions for future research and development for CNN accelerators. Other sample topics covered in Accelerators for Convolutional Neural Networks include How to apply arithmetic coding and decoding with range scaling for lossless weight compression for 5-bit CNN weights to deploy CNNs in extremely resource-constrained systems State-of-the-art research surrounding dense CNN accelerators, which are mostly based on systolic arrays or parallel multiply-accumulate (MAC) arrays iMAC dense CNN accelerator, which combines image-to-column (im2col) and general matrix multiplication (GEMM) hardware acceleration Multi-threaded, low-cost, log-based processing element (PE) core, instances of which are stacked in a spatial grid to engender NeuroMAX dense accelerator Sparse-PE, a multi-threaded and flexible CNN PE core that exploits sparsity in both weights and activation maps, instances of which can be stacked in a spatial grid for engendering sparse CNN accelerators For researchers in AI, computer vision, computer architecture, and embedded systems, along with graduate and senior undergraduate students in related programs of study, Accelerators for Convolutional Neural Networks is an essential resource to understanding the many facets of the subject and relevant applications.

Author(s): Arslan Munir; Joonho Kong; Mahmood Azhar Qureshi
Publisher: Wiley
Year: 2023

Language: English
Pages: 307

Cover
Title Page
Copyright
Contents
About the Authors
Preface
Part I Overview
Chapter 1 Introduction
1.1 History and Applications
1.2 Pitfalls of High‐Accuracy DNNs/CNNs
1.2.1 Compute and Energy Bottleneck
1.2.2 Sparsity Considerations
1.3 Chapter Summary
Chapter 2 Overview of Convolutional Neural Networks
2.1 Deep Neural Network Architecture
2.2 Convolutional Neural Network Architecture
2.2.1 Data Preparation
2.2.2 Building Blocks of CNNs
2.2.2.1 Convolutional Layers
2.2.2.2 Pooling Layers
2.2.2.3 Fully Connected Layers
2.2.3 Parameters of CNNs
2.2.4 Hyperparameters of CNNs
2.2.4.1 Hyperparameters Related to Network Structure
2.2.4.2 Hyperparameters Related to Training
2.2.4.3 Hyperparameter Tuning
2.3 Popular CNN Models
2.3.1 AlexNet
2.3.2 VGGNet
2.3.3 GoogleNet
2.3.4 SqueezeNet
2.3.5 Binary Neural Networks
2.3.6 EfficientNet
2.4 Popular CNN Datasets
2.4.1 MNIST Dataset
2.4.2 CIFAR
2.4.3 ImageNet
2.5 CNN Processing Hardware
2.5.1 Temporal Architectures
2.5.2 Spatial Architectures
2.5.3 Near‐Memory Processing
2.6 Chapter Summary
Part II Compressive Coding for CNNs
Chapter 3 Contemporary Advances in Compressive Coding for CNNs
3.1 Background of Compressive Coding
3.2 Compressive Coding for CNNs
3.3 Lossy Compression for CNNs
3.4 Lossless Compression for CNNs
3.5 Recent Advancements in Compressive Coding for CNNs
3.6 Chapter Summary
Chapter 4 Lossless Input Feature Map Compression
4.1 Two‐Step Input Feature Map Compression Technique
4.2 Evaluation
4.3 Chapter Summary
Chapter 5 Arithmetic Coding and Decoding for 5‐Bit CNN Weights
5.1 Architecture and Design Overview
5.2 Algorithm Overview
5.2.1 Weight Encoding Algorithm
5.3 Weight Decoding Algorithm
5.4 Encoding and Decoding Examples
5.4.1 Decoding Hardware
5.5 Evaluation Methodology
5.6 Evaluation Results
5.6.1 Compression Ratio and Memory Energy Consumption
5.6.2 Latency Overhead
5.6.3 Latency vs. Resource Usage Trade‐Off
5.6.4 System‐Level Energy Estimation
5.7 Chapter Summary
Part III Dense CNN Accelerators
Chapter 6 Contemporary Dense CNN Accelerators
6.1 Background on Dense CNN Accelerators
6.2 Representation of the CNN Weights and Feature Maps in Dense Format
6.3 Popular Architectures for Dense CNN Accelerators
6.4 Recent Advancements in Dense CNN Accelerators
6.5 Chapter Summary
Chapter 7 iMAC: Image‐to‐Column and General Matrix Multiplication‐Based Dense CNN Accelerator
7.1 Background and Motivation
7.2 Architecture
7.3 Implementation
7.4 Chapter Summary
Chapter 8 NeuroMAX: A Dense CNN Accelerator
8.1 Related Work
8.2 Log Mapping
8.3 Hardware Architecture
8.3.1 Top‐Level
8.3.2 PE Matrix
8.4 Data Flow and Processing
8.4.1 3×3 Convolution
8.4.2 1×1 Convolution
8.4.3 Higher‐Order Convolutions
8.5 Implementation and Results
8.6 Chapter Summary
Part IV Sparse CNN Accelerators
Chapter 9 Contemporary Sparse CNN Accelerators
9.1 Background of Sparsity in CNN Models
9.2 Background of Sparse CNN Accelerators
9.3 Recent Advancements in Sparse CNN Accelerators
9.4 Chapter Summary
Chapter 10 CNN Accelerator for In Situ Decompression and Convolution of Sparse Input Feature Maps
10.1 Overview
10.2 Hardware Design Overview
10.3 Design Optimization Techniques Utilized in the Hardware Accelerator
10.4 FPGA Implementation
10.5 Evaluation Results
10.5.1 Performance and Energy
10.5.2 Comparison with State‐of‐the‐Art Hardware Accelerator Implementations
10.6 Chapter Summary
Chapter 11 Sparse‐PE: A Sparse CNN Accelerator
11.1 Related Work
11.2 Sparse‐PE
11.2.1 Sparse Binary Mask
11.2.2 Selection
11.2.3 Computation
11.2.4 Accumulation
11.2.5 Output Encoding
11.3 Implementation and Results
11.3.1 Cycle‐Accurate Simulator
11.3.1.1 Performance with Varying Sparsity
11.3.1.2 Comparison Against Past Approaches
11.3.2 RTL Implementation
11.4 Chapter Summary
Chapter 12 Phantom: A High‐Performance Computational Core for Sparse CNNs
12.1 Related Work
12.2 Phantom
12.2.1 Sparse Mask Representation
12.2.2 Core Architecture
12.2.3 Lookahead Masking
12.2.4 Top‐Down Selector
12.2.4.1 In‐Order Selection
12.2.4.2 Out‐of‐Order Selection
12.2.5 Thread Mapper
12.2.6 Compute Engine
12.2.7 Output Buffer
12.2.8 Output Encoding
12.3 Phantom‐2D
12.3.1 R×C Compute Matrix
12.3.2 Load Balancing
12.3.3 Regular/Depthwise Convolution
12.3.3.1 Intercore Balancing
12.3.4 Pointwise Convolution
12.3.5 FC Layers
12.3.6 Intracore Balancing
12.4 Experiments and Results
12.4.1 Evaluation Methodology
12.4.1.1 Cycle‐Accurate Simulator
12.4.1.2 Simulated Models
12.4.2 Results
12.4.2.1 TDS Variants Comparison
12.4.2.2 Impact of Load Balancing
12.4.2.3 Sensitivity to Sparsity and Lf
12.4.2.4 Comparison Against Past Approaches
12.4.2.5 RTL Synthesis Results
12.5 Chapter Summary
Part V HW/SW Co-Design and Co-Scheduling for CNN Acceleration
Chapter 13 State‐of‐the‐Art in HW/SW Co‐Design and Co‐Scheduling for CNN Acceleration
13.1 HW/SW Co‐Design
13.1.1 Case Study: Cognitive IoT
13.1.2 Recent Advancements in HW/SW Co‐Design
13.2 HW/SW Co‐Scheduling
13.2.1 Recent Advancements in HW/SW Co‐Scheduling
13.3 Chapter Summary
Chapter 14 Hardware/Software Co‐Design for CNN Acceleration
14.1 Background of iMAC Accelerator
14.2 Software Partition for iMAC Accelerator
14.2.1 Channel Partition and Input/Weight Allocation to Hardware Accelerator
14.2.2 Exploiting Parallelism Within Convolution Layer Operations
14.3 Experimental Evaluations
14.4 Chapter Summary
Chapter 15 CPU‐Accelerator Co‐Scheduling for CNN Acceleration
15.1 Background and Preliminaries
15.1.1 Convolutional Neural Networks
15.1.2 Baseline System Architecture
15.2 CNN Acceleration with CPU‐Accelerator Co‐Scheduling
15.2.1 Overview
15.2.2 Linear Regression‐Based Latency Model
15.2.2.1 Accelerator Latency Model
15.2.2.2 CPU Latency Model
15.2.3 Channel Distribution
15.2.4 Prototype Implementation
15.3 Experimental Results
15.3.1 Latency Model Accuracy
15.3.2 Performance
15.3.3 Energy
15.3.4 Case Study: Tiny Darknet CNN Inferences
15.4 Chapter Summary
Chapter 16 Conclusions
References
Index
EULA