Deep Learning Architectures - A Mathematical Approach

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

This book describes how neural networks operate from the mathematical point of view. As a result, neural networks can be interpreted both as function universal approximators and information processors. The book bridges the gap between ideas and concepts of neural networks, which are used nowadays at an intuitive level, and the precise modern mathematical language, presenting the best practices of the former and enjoying the robustness and elegance of the latter. This book can be used in a graduate course in deep learning, with the first few parts being accessible to senior undergraduates. In addition, the book will be of wide interest to machine learning researchers who are interested in a theoretical understanding of the subject.

Author(s): Ovidiu Calin
Series: Springer Series in the Data Sciences
Edition: 1
Publisher: Springer
Year: 2020

Language: English
Pages: 768
Tags: Machine Learning, Neural Networks

Foreword
Overview
Part I
Part II
Part III
Part IV
Part V
Bibliographical Remarks
Chapters Diagram
Notations and Symbols
Calculus
Linear Algebra
Probability Theory
Measure Theory
Information Theory
Differential Geometry
Neural Networks
Contents
Part I Introduction to Neural Networks
1 Introductory Problems
1.1 Water in a Sink
1.2 An Electronic Circuit
1.3 The Eight Rooks Problem
1.4 Biological Neuron
1.5 Linear Regression
1.6 The Cocktail Factory Network
1.7 An Electronic Network
1.8 Summary
1.9 Exercises
2 Activation Functions
2.1 Examples of Activation Functions
2.2 Sigmoidal Functions
2.3 Squashing Functions
2.4 Summary
2.5 Exercises
3 Cost Functions
3.1 Input, Output, and Target
3.2 The Supremum Error Function
3.3 The $L2$-Error Function
3.4 Mean Square Error Function
3.5 Cross-entropy
3.6 Kullback-Leibler Divergence
3.7 Jensen-Shannon Divergence
3.8 Maximum Mean Discrepancy
3.9 Other Cost Functions
3.10 Sample Estimation of Cost Functions
3.11 Cost Functions and Regularization
3.12 Training and Test Errors
3.13 Geometric Significance
3.14 Summary
3.15 Exercises
4 Finding Minima Algorithms
4.1 General Properties of Minima
4.1.1 Functions of a real variable
4.1.2 Functions of several real variables
4.2 Gradient Descent Algorithm
4.2.1 Level sets
4.2.2 Directional derivative
4.2.3 Method of Steepest Descent
4.2.4 Line Search Method
4.3 Kinematic Interpretation
4.4 Momentum Method
4.4.1 Kinematic Interpretation
4.4.2 Convergence conditions
4.5 AdaGrad
4.6 RMSProp
4.7 Adam
4.8 AdaMax
4.9 Simulated Annealing Method
4.9.1 Kinematic Approach for SA
4.9.2 Thermodynamic Interpretation for SA
4.10 Increasing Resolution Method
4.11 Hessian Method
4.12 Newton's Method
4.13 Stochastic Search
4.13.1 Deterministic variant
4.13.2 Stochastic variant
4.14 Neighborhood Search
4.14.1 Left and Right Search
4.14.2 Circular Search
4.14.3 Stochastic Spherical Search
4.14.4 From Local to Global
4.15 Continuous Learning
4.16 Summary
4.17 Exercises
5 Abstract Neurons
5.1 Definition and Properties
5.2 Perceptron Model
5.3 The Sigmoid Neuron
5.4 Logistic Regression
5.4.1 Default probability of a company
5.4.2 Binary Classifier
5.4.3 Learning with the square difference cost function
5.5 Linear Neuron
5.6 Adaline
5.7 Madaline
5.8 Continuum Input Neuron
5.9 Summary
5.10 Exercises
6 Neural Networks
6.1 An Example of Neural Network
6.1.1 Total variation and regularization
6.1.2 Backpropagation
6.2 General Neural Networks
6.2.1 Forward pass through the network
6.2.2 Going backwards through the network
6.2.3 Backpropagation of deltas
6.2.4 Concluding relations
6.2.5 Matrix form
6.2.6 Gradient descent algorithm
6.2.7 Vanishing gradient problem
6.2.8 Batch training
6.2.9 Definition of FNN
6.3 Weights Initialization
6.4 Strong and Weak Priors
6.5 Summary
6.6 Exercises
Part II Analytic Theory
7 Approximation Theorems
7.1 Dini's Theorem
7.2 Arzela-Ascoli's Theorem
7.3 Application to Neural Networks
7.4 Stone-Weierstrass' Theorem
7.5 Application to Neural Networks
7.6 Wiener's Tauberian Theorems
7.6.1 Learning signals in $L1(mathbbR)$
7.6.2 Learning signals in $L2(mathbbR)$
7.7 Contraction Principle
7.8 Application to Recurrent Neural Nets
7.9 Resonance Networks
7.10 Summary
7.11 Exercises
8 Learning with One-dimensional Inputs
8.1 Preliminary Results
8.2 One-Hidden Layer Perceptron Network
8.3 One-Hidden Layer Sigmoid Network
8.4 Learning with ReLU Functions
8.5 Learning with Softplus
8.6 Discrete Data
8.7 Summary
8.8 Exercises
9 Universal Approximators
9.1 Introductory Examples
9.2 General Setup
9.3 Single Hidden Layer Networks
9.3.1 Learning continuous functions $finC(In)$
9.3.2 Learning square integrable functions $finL2(In)$
9.3.3 Learning integrable functions $finL1(In)$
9.3.4 Learning measurable functions $fincalM(mathbbRn)$
9.4 Error Bounds Estimation
9.5 Learning $q$-integrable functions, $finLq(mathbbRn)$
9.6 Learning Solutions of ODEs
9.7 Summary
9.8 Exercises
10 Exact Learning
10.1 Learning Finite Support Functions
10.2 Learning with ReLU
10.2.1 Representation of maxima
10.2.2 Width versus Depth
10.3 Kolmogorov-Arnold-Sprecher Theorem
10.4 Irie and Miyake's Integral Formula
10.5 Exact Learning Not Always Possible
10.6 Continuum Number of Neurons
10.7 Approximation by Degenerate Kernels
10.8 Examples of Degenerate Kernels
10.9 Deep Networks
10.10 Summary
10.11 Exercises
Part III Information Processing
11 Information Representation
11.1 Information Content
11.2 Learnable Targets
11.3 Lost Information
11.4 Recoverable Information
11.5 Information Representation Examples
11.5.1 Information for indicator functions
11.5.2 Classical perceptron
11.5.3 Linear neuron
11.5.4 Sigmoid neuron
11.5.5 Arctangent neuron
11.5.6 Neural network of classical perceptrons
11.5.7 Triangle as a union of rectangles
11.5.8 Network of sigmoid neurons
11.5.9 More remarks
11.6 Compressible Layers
11.7 Layers Compressibility
11.8 Functional Independence
11.9 Summary
11.10 Exercises
12 Information Capacity Assessment
12.1 Entropy and Properties
12.1.1 Entropy of a random variable
12.1.2 Entropy under a change of coordinates
12.2 Entropy Flow
12.3 The Input Layer Entropy
12.4 The Linear Neuron
12.5 The Autoencoder
12.6 Conditional Entropy
12.7 The Mutual Information
12.8 Applications to Deep Neural Networks
12.8.1 Entropy Flow
12.8.2 Noisy layers
12.8.3 Independent layers
12.8.4 Compressionless layers
12.8.5 The number of features
12.8.6 Total Compression
12.9 Network Capacity
12.9.1 Types of capacity
12.9.2 The input distribution
12.9.3 The output distribution
12.9.4 The input-output tensor
12.9.5 The input-output matrix
12.9.6 The existence of network capacity
12.9.7 The Lagrange multiplier method
12.9.8 Finding the Capacity
12.9.9 Perceptron Capacity
12.10 The Information Bottleneck
12.10.1 An exact formal solution
12.10.2 The information plane
12.11 Information Processing with MNIST
12.11.1 A two-layer FNN
12.11.2 A three-layer FNN
12.11.3 The role of convolutional nets
12.12 Summary
12.13 Exercises
Part IV Geometric Theory
13 Output Manifolds
13.1 Introduction to Manifolds
13.1.1 Intrinsic and Extrinsic
13.1.2 Tangent space
13.1.3 Riemannian metric
13.1.4 Geodesics
13.1.5 Levi-Civita Connection
13.1.6 Submanifolds
13.1.7 Second Fundamental Form
13.1.8 Mean Curvature Vector Field
13.2 Relation to Neural Networks
13.3 The Parameters Space
13.4 Optimal Parameter Values
13.5 The Metric Structure
13.6 Regularization
13.6.1 Going for a smaller dimension
13.6.2 Norm regularization
13.6.3 Choosing the flattest manifold
13.6.4 Model averaging
13.6.5 Dropout
13.7 Summary
13.8 Exercises
14 Neuromanifolds
14.1 Statistical Manifolds
14.2 Fisher Information
14.3 Neuromanifold of a Neural Net
14.4 Fisher Metric for One Neuron
14.5 The Fisher Matrix and Its Inverse
14.6 The Fisher Metric Structure of a Neural Net
14.7 The Natural Gradient
14.8 The Natural Gradient Learning Algorithm
14.9 Log-likelihood and the Metric
14.10 Relation to the Kullback-Leibler Divergence
14.11 Simulated Annealing Method
14.12 Summary
14.13 Exercises
Part V Other Architectures
15 Pooling
15.1 Approximation of Continuous Functions
15.2 Translation Invariance
15.3 Information Approach
15.4 Pooling and Classification
15.5 Summary
15.6 Exercises
16 Convolutional Networks
16.1 Discrete One-dimensional Signals
16.2 Continuous One-dimensional Signals
16.3 Discrete Two-dimensional Signals
16.4 Convolutional Layers with 1-D Input
16.5 Convolutional Layers with 2-D Input
16.6 Geometry of CNNs
16.7 Equivariance and Invariance
16.7.1 Groups
16.7.2 Actions of groups on sets
16.7.3 Extension of actions to functions
16.7.4 Definition of equivariance
16.7.5 Convolution and equivariance
16.7.6 Definition of invariance
16.8 Summary
16.9 Exercises
17 Recurrent Neural Networks
17.1 States Systems
17.2 RNNs
17.3 Information in RNNs
17.4 Loss Functions
17.5 Backpropagation Through Time
17.6 The Gradient Problems
17.7 LSTM Cells
17.8 Deep RNNs
17.9 Summary
17.10 Exercises
18 Classification
18.1 Equivalence Relations
18.2 Entropy of a Partition
18.3 Decision Functions
18.4 One-hot-vector Decision Maps
18.5 Linear Separability
18.6 Convex Separability
18.7 Contraction Toward a Center
18.8 Learning Decision Maps
18.8.1 Linear Decision Maps
18.8.2 Nonlinear Decision Maps
18.9 Decision Boundaries
18.10 Summary
18.11 Exercises
19 Generative Models
19.1 The Need of Generative Models
19.2 Density Estimation
19.3 Adversarial Games
19.4 Generative Adversarial Networks
19.5 Generative Moment Matching Networks
19.6 Summary
19.7 Exercises
20 Stochastic Networks
20.1 Stochastic Neurons
20.2 Boltzmann Distribution
20.3 Boltzmann Machines
20.4 Boltzmann Learning
20.5 Computing the Boltzmann Distribution
20.6 Entropy of Boltzmann Distribution
20.7 Fisher Information
20.8 Applications of Boltzmann Machines
20.9 Summary
20.10 Exercises
Hints and Solutions
Appendix
Appendix A Set Theory
Appendix B Tensors
Appendix C Measure Theory
Appendix C.1 Information and mathfrakS-algebras
Appendix C.2 Measurable Functions
Appendix C.3 Measures
Appendix C.4 Integration in Measure
Appendix C.5 Image Measures
Appendix C.6 Indefinite Integrals
Appendix C.7 Radon-Nikodym Theorem
Appendix C.8 Egorov and Luzin's Theorems
Appendix C.9 Signed Measures
Appendix D Probability Theory
Appendix D.1 General definitions
Appendix D.2 Examples
Appendix D.3 Expectation
Appendix D.4 Variance
Appendix D.5 Information generated by random variables
Appendix D.5.1 Filtrations
Appendix D.5.2 Conditional expectations
Appendix D.6 Types of Convergence
Appendix D.6.1 Convergence in probability
Appendix D.6.2 Almost sure convergence
Appendix D.6.3 Lp-convergence
Appendix D.6.4 Weak convergence
Appendix D.7 Log-Likelihood Function
Appendix D.8 Brownian Motion
Appendix E Functional Analysis
Appendix E.1 Banach spaces
Appendix E.2 Linear Operators
Appendix E.3 Hahn-Banach Theorem
Appendix E.4 Hilbert Spaces
Appendix E.5 Representation Theorems
Appendix E.6 Fixed Point Theorem
Appendix F Real Analysis
Appendix F.1 Inverse Function Theorem
Appendix F.2 Differentiation in generalized sense
Appendix F.3 Convergence of sequences of functions
Appendix G Linear Algebra
Appendix G.1 Eigenvalues, Norm, and Inverse Matrix
Appendix G.2 Moore-Penrose Pseudoinverse
Bibliography
Index