AI Computing Systems: An Application Driven Perspective adopts the principle of "application-driven, full-stack penetration" and uses the specific intelligent application of "image style migration" to provide students with a sound starting place to learn. This approach enables readers to obtain a full view of the AI computing system. A complete intelligent computing system involves many aspects such as processing chip, system structure, programming environment, software, etc., making it a difficult topic to master in a short time.
Key Features:
- Provides an in-depth analysis of the underlying principles behind the use of knowledge in intelligent computing systems
- Centers around application-driven and full-stack penetration, focusing on the knowledge required to complete this application at all levels of the software and hardware technology stack
- Supporting experimental tutorials covering key knowledge points in each chapter provide practical guidance and formalization tools for developing a simple AI computing system
Readership:
Graduate level students taking advanced artificial intelligence courses within computer science and computer engineering; AI researchers
Author(s): Yunji Chen, Ling Li, Wei Li, Qi Guo, Zidong Du, Zichen Xu
Edition: 1
Publisher: Morgan Kaufmann
Year: 2023
Language: English
Pages: 600
City: Cambridge, MA
Tags: Artificial Intelligence; Neural Networks; Deep Learning; AI Computing Systems; AI Programming Frameworks
Cover
Contents
Preface for the English version
Preface
Motivation for this book
Value of an AI computing systems course
Content of the AI computing systems course
Writing of this book
Biographies
1 Introduction
1.1 Artificial intelligence
1.1.1 What is artificial intelligence?
1.1.2 The history of AI
1.1.2.1 1956–1960s, the First Wave
1.1.2.2 1975–1991, the Second Wave
1.1.2.3 2006–present, the Third Wave
1.1.3 Mainstreams in AI
1.1.3.1 Behaviorism
1.1.3.2 Symbolism
1.1.3.3 Connectionism
1.2 AI computing systems
1.2.1 What are AI computing systems?
1.2.2 The necessity of AICSs
1.2.3 Trends in AICSs
1.2.3.1 The first generation of AICS
1.2.3.2 The second generation of AICSs (2010–present)
1.2.3.3 The future of the third generation of AICSs
1.3 A driving example
1.4 Summary
Exercises
2 Fundamentals of neural networks
2.1 From machine learning to neural networks
2.1.1 Basic concepts
2.1.2 Linear regression
2.1.3 Perceptron
2.1.4 Two-layer neural network: multilayer perceptron
2.1.5 Deep neural networks (deep learning)
2.1.6 The history of neural networks
2.2 Neural network training
2.2.1 Forward propagation
2.2.2 Backward propagation
2.3 Neural network design: the principle
2.3.1 Network topology
2.3.2 Activation function
2.3.2.1 Sigmoid function
2.3.2.2 Tanh function
2.3.2.3 ReLU function
2.3.2.4 PReLU/Leaky ReLU function
2.3.2.5 ELU function
2.3.3 Loss function
2.3.3.1 Mean squared error loss function
2.3.3.2 Cross-entropy loss function
2.4 Overfitting and regularization
2.4.1 Overfitting
2.4.2 Regularization
2.4.2.1 Parameter norm penalty
2.4.2.2 Sparsification
2.4.2.3 Bagging
2.4.2.4 Dropout
2.4.2.5 Summary
2.5 Cross-validation
2.6 Summary
Exercises
3 Deep learning
3.1 Convolutional neural networks for image processing
3.1.1 CNN components
3.1.2 Convolutional layer
3.1.2.1 Convolution operation
3.1.2.2 Convolution on multiple input-output feature maps
3.1.2.3 Feature detection in the convolutional layer
3.1.2.4 Padding
3.1.2.5 Stride
3.1.2.6 Summary
3.1.3 Pooling layer
3.1.4 Fully connected layer
3.1.5 Softmax layer
3.1.6 CNN architecture
3.2 CNN-based classification algorithms
3.2.1 AlexNet
3.2.1.1 LRN
3.2.1.2 Dropout
3.2.1.3 Summary
3.2.2 VGG
3.2.2.1 Network architecture
3.2.2.2 Convolution-pooling architecture
3.2.2.3 Summary
3.2.3 Inception
3.2.3.1 Inception-v1
3.2.3.2 BN-Inception
3.2.3.3 Inception-v3
3.2.3.4 Summary
3.2.4 ResNet
3.3 CNN-based object detection algorithms
3.3.1 Evaluation metrics
3.3.1.1 IoU
3.3.1.2 mAP
3.3.2 R-CNN series
3.3.2.1 R-CNN
3.3.2.2 Fast R-CNN
3.3.2.3 Faster R-CNN
3.3.2.4 Summary
3.3.3 YOLO
3.3.3.1 Unified detection
3.3.3.2 Network architecture
3.3.3.3 Summary
3.3.4 SSD
3.3.5 Summary
3.4 Sequence models: recurrent neural networks
3.4.1 RNNs
3.4.1.1 RNN architecture
3.4.1.2 Back-propagation of RNN
3.4.2 LSTM
3.4.3 GRU
3.4.4 Summary
3.5 Generative adversarial networks
3.5.1 GAN modeling
3.5.2 Training in GAN
3.5.2.1 The training process
3.5.2.2 Loss function
3.5.2.3 Problems with GAN
3.5.3 The GAN framework
3.5.3.1 Deep convolutional GAN
3.5.3.2 Conditional GAN
3.6 Driving example
3.6.1 CNN-based image style transfer
3.6.2 Real-time style transfer
3.7 Summary
Exercises
4 Fundamentals of programming frameworks
4.1 Necessities of programming frameworks
4.2 Fundamentals of programming frameworks
4.2.1 Generic programming frameworks
4.2.2 TensorFlow basics
4.3 TensorFlow: model and tutorial
4.3.1 Computational graph
4.3.2 Operations
4.3.3 Tensors
4.3.3.1 Tensor data type
4.3.3.2 Tensor shape
4.3.3.3 Tensor device
4.3.3.4 Tensor operations
4.3.4 Tensor session
4.3.4.1 Session creation
4.3.4.2 Session execution
4.3.4.3 Session close
4.3.4.4 Tensor evaluation
4.3.5 Variable
4.3.5.1 Variable creation
4.3.5.2 Variable initialization
4.3.5.3 Variable updating
4.3.6 Placeholders
4.3.7 Queue
4.4 Deep learning inference in TensorFlow
4.4.1 Load input
4.4.2 Define the basic operations
4.4.3 Create neural network models
4.4.4 Output prediction
4.5 Deep learning training in TensorFlow
4.5.1 Data loading
4.5.1.1 Feeding
4.5.1.2 Prefetching
4.5.1.3 Build input pipeline based on queue API
4.5.1.4 tf.data API
4.5.2 Training models
4.5.2.1 Define loss function
4.5.2.2 Create an optimizer
4.5.2.3 Define the training method
4.5.3 Model checkpoint
4.5.3.1 Save model
4.5.3.2 Restore model
4.5.4 Image style transfer training
4.6 Summary
Exercises
5 Programming framework principles
5.1 TensorFlow design principles
5.1.1 High performance
5.1.2 Easy development
5.1.3 Portability
5.2 TensorFlow computational graph mechanism
5.2.1 Computational graph
5.2.1.1 Automatic differentiation
5.2.1.2 Checkpoint
5.2.1.3 Control flow
5.2.1.4 Execution mode
5.2.2 Local execution of a computational graph
5.2.2.1 Computational graph pruning
5.2.2.2 Computational graph placement
5.2.2.3 Computational graph optimization
5.2.2.4 Computational graph partitioning and device communication
5.2.3 Distributed execution of computational graphs
5.2.3.1 Distributed communication
5.2.3.2 Fault tolerance mechanism
5.3 TensorFlow system implementation
5.3.1 Overall architecture
5.3.2 Computational graph execution module
5.3.2.1 Session execution
5.3.2.2 Executor logic
5.3.3 Device abstraction and management
5.3.4 Network and communication
5.3.4.1 Local communication: LocalRendezvousImpl
5.3.4.2 Remote communication: RemoteRendezvous
5.3.5 Operator definition
5.4 Programming framework comparison
5.4.1 TensorFlow
5.4.2 PyTorch
5.4.3 MXNet
5.4.4 Caffe
5.5 Summary
Exercises
6 Deep learning processors
6.1 Deep learning processors (DLPs)
6.1.1 The purpose of DLPs
6.1.2 The development history of DLPs
6.1.3 The design motivation
6.2 Deep learning algorithm analysis
6.2.1 Computational characteristics
6.2.1.1 Fully connected layer
6.2.1.2 Convolutional layer
6.2.1.3 Pooling layer
6.2.2 Memory access patterns
6.2.2.1 Fully connected layer
6.2.2.2 Convolutional layer
6.2.2.3 Pooling layer
6.3 DLP architecture
6.3.1 Instruction set architecture
6.3.2 Pipeline
6.3.3 Computing unit
6.3.3.1 Vector MAC
6.3.3.2 Extensions to vector MAC
6.3.3.3 VFU and MFU
6.3.4 Memory access unit
6.3.5 Mapping from algorithm to chip
6.3.6 Summary
6.4 *Optimization design
6.4.1 Scalar MAC-based computing unit
6.4.2 Sparsity
6.4.3 Low bit-width
6.5 Performance evaluation
6.5.1 Performance metrics
6.5.2 Benchmarking
6.5.3 Factors affecting performance
6.6 Other accelerators
6.6.1 The GPU architecture
6.6.2 The FPGA architecture
6.6.3 Comparison of DLPs, GPU, and FPGA
6.7 Summary
Exercises
7 Architecture for AI computing systems
7.1 Single-core deep learning processor
7.1.1 Overall architecture
7.1.2 Control module
7.1.2.1 Instruction fetching unit
7.1.2.2 Instruction decoding unit
7.1.3 Arithmetic module
7.1.3.1 VFU
7.1.3.2 MFU
7.1.4 Storage unit
7.1.5 Summary of single-core deep learning processor
7.2 The multicore deep learning processor
7.2.1 The DLP-M architecture
7.2.2 The cluster architecture
7.2.2.1 Broadcast bus
7.2.2.2 CDMA
7.2.2.3 GDMA
7.2.2.4 Multicore synchronization model
7.2.3 Interconnection architecture
7.2.3.1 The topology of multicore interconnections
7.2.3.2 Interconnection implementation
7.2.3.3 Interconnection between DLP-Cs
7.2.4 Summary of multicore deep learning processors
7.3 Summary
Exercises
8 AI programming language for AI computing systems
8.1 Necessity of AI programming language
8.1.1 Semantic gap
8.1.2 Hardware gap
8.1.3 Platform gap
8.1.4 Summary
8.2 Abstraction of AI programming language
8.2.1 Abstract hardware architecture
8.2.2 Typical AI computing system
8.2.3 Control model
8.2.4 Computation model
8.2.4.1 Customized computing unit
8.2.4.2 Parallel computing architecture
8.2.5 Memory model
8.3 Programming models
8.3.1 Heterogeneous programming model
8.3.1.1 Overview
8.3.1.2 Basic process
8.3.1.3 Compiler support
8.3.1.4 Runtime support
8.3.2 General AI programming model
8.3.2.1 Kernel function
8.3.2.2 Compiler support
8.3.2.3 Runtime support
8.4 Fundamentals of AI programming language
8.4.1 Syntax overview
8.4.2 Data type
8.4.2.1 Precision type
8.4.2.2 Semantic type
8.4.3 Macros, constants, and built-in variables
8.4.4 I/O operation
8.4.5 Scalar computation
8.4.6 Tensor computation
8.4.7 Control flow
8.4.7.1 Branch
8.4.7.2 Loop
8.4.7.3 Synchronization
8.4.8 Serial program example
8.4.9 Parallel program example
8.5 Programming interface of AI applications
8.5.1 Kernel function interface
8.5.1.1 Overview
8.5.1.2 API introduction
8.5.2 Runtime interface
8.5.2.1 Device management
8.5.2.2 Queue management
8.5.2.3 Memory management
8.5.3 Usage example
8.5.3.1 Writing kernel function
8.5.3.2 Device initialization
8.5.3.3 Host/device-side data preparation
8.5.3.4 Device-side memory space allocation
8.5.3.5 Copy data to the device
8.5.3.6 Invoking kernel to start the device
8.5.3.7 Obtaining the result
8.5.3.8 Resource release
8.6 Debugging AI applications
8.6.1 Functional debugging method
8.6.1.1 Overview
8.6.1.2 Programming language debugging
8.6.1.3 Programming framework debugging
8.6.2 Function debugging interface
8.6.2.1 Functional debugging interface of programming language
8.6.2.2 Functional debugging interface of programming framework
8.6.3 Function debugging tool
8.6.3.1 Debugger for programming languages
8.6.3.2 Debugger for programming framework
8.6.4 Precision debugging method
8.6.5 Function debugging practice
8.6.5.1 Serial program debugging
8.6.5.2 Parallel program debugging
8.6.5.3 Precision debugging
8.7 Optimizing AI applications
8.7.1 Performance tuning method
8.7.1.1 Overview
8.7.1.2 Use on-chip memory
8.7.1.3 Tensor computation
8.7.1.4 Multicore parallel
8.7.2 Performance tuning interface
8.7.2.1 Notifier interface
8.7.2.2 Hardware performance counter interface
8.7.3 Performance tuning tools
8.7.3.1 Application-level profiling tools
8.7.3.2 System-level monitoring tools
8.7.4 Performance tuning practice
8.7.4.1 Overall process
8.7.4.2 Use on-chip cache
8.7.4.3 Tensorization
8.7.4.4 Algorithm optimization
8.7.4.5 Constant preprocessing
8.7.4.6 Analysis of optimization results
8.8 System development on AI programming language
8.8.1 High-performance library operator development
8.8.1.1 Principle and process
8.8.1.2 Customized operator integration
8.8.1.3 High-performance library operator development example
8.8.2 Programming framework operator development
8.8.2.1 Principle and process
8.8.2.2 TensorFlow integrates customized operators
8.8.3 System development and optimization practice
8.8.3.1 Overall performance analysis
8.8.3.2 Fusion operator development
8.8.3.3 Fusion operator substitution
8.8.3.4 Fusion operator integration
8.8.3.5 Analysis of optimization results
8.9 Exercises
9 Practice: AI computing systems
9.1 Basic practice: image style transfer
9.1.1 Operator implementation based on AI programming language
9.1.1.1 Difference square operator
9.1.1.2 Fractionally strided convolution operator
9.1.1.3 Precision metrics
9.1.1.4 Customized operators integrated into the TensorFlow framework
9.1.2 Implementation of image style transfer
9.1.2.1 Low-bit-width representation of the model
9.1.2.2 Deep learning inference
9.1.3 Image style transfer practice
9.1.3.1 Implementation of image style transfer based on a cloud platform
9.1.3.2 Implementation of image style transfer based on the development board
9.2 Advanced practice: object detection
9.2.1 Operator implementation based on AI programming language
9.2.1.1 Implementation of PostprocessRpnKernel
9.2.1.2 Implementation of SecondStagePostprocessKernel
9.2.1.3 Fusion operator replacement
9.2.1.4 Fusion operator integration
9.2.2 Implementation of object detection
9.3 Extended practices
A Fundamentals of computer architecture
A.1 The instruction set of general-purpose CPUs
A.2 Memory hierarchy in computing systems
A.2.1 Cache
A.2.2 Scratchpad memory
B Experimental environment
B.1 Cloud platform
B.1.1 Login
B.1.2 Changing password
B.1.3 Set up SSH client
B.1.4 Unzipping the file package
B.1.5 Setting environment variables
B.2 Development board
References
Final words
Index