Artificial Intelligence and Causal Inference

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Artificial Intelligence and Causal Inference address the recent development of relationships between artificial intelligence (AI) and causal inference. Despite significant progress in AI, a great challenge in AI development we are still facing is to understand mechanism underlying intelligence, including reasoning, planning and imagination. Understanding, transfer and generalization are major principles that give rise intelligence. One of a key component for understanding is causal inference. Causal inference includes intervention, domain shift learning, temporal structure and counterfactual thinking as major concepts to understand causation and reasoning. Unfortunately, these essential components of the causality are often overlooked by machine learning, which leads to some failure of the deep learning. AI and causal inference involve (1) using AI techniques as major tools for causal analysis and (2) applying the causal concepts and causal analysis methods to solving AI problems. The purpose of this book is to fill the gap between the AI and modern causal analysis for further facilitating the AI revolution. This book is ideal for graduate students and researchers in AI, data science, causal inference, statistics, genomics, bioinformatics and precision medicine.

Key Features:

  • Cover three types of neural networks, formulate deep learning as an optimal control problem and use Pontryagin’s Maximum Principle for network training.
  • Deep learning for nonlinear mediation and instrumental variable causal analysis.
  • Construction of causal networks is formulated as a continuous optimization problem.
  • Transformer and attention are used to encode-decode graphics. RL is used to infer large causal networks.
  • Use VAE, GAN, neural differential equations, recurrent neural network (RNN) and RL to estimate counterfactual outcomes.
  • AI-based methods for estimation of individualized treatment effect in the presence of network interference.

Author(s): Momiao Xiong
Series: Chapman & Hall/CRC Machine Learning & Pattern Recognition
Publisher: CRC Press
Year: 2022

Language: English
Pages: 424
City: Boca Raton

Cover
Half Title
Series Page
Title Page
Copyright Page
Contents
Preface
CHAPTER 1: Deep Neural Networks
1.1. THREE TYPES OF NEURAL NETWORKS
1.1.1. Multilayer Feedforward Neural Networks
1.1.1.1. Architecture of Feedforward Neural Networks
1.1.1.2. Loss Function and Training Algorithms
1.1.2. Convolutional Neural Network
1.1.2.1. Convolution
1.1.2.2. Nonlinearity (ReLU)
1.1.2.3. Pooling
1.1.2.4. Fully Connected Layers
1.1.3. Recurrent Neural Networks
1.1.3.1. Simple RNN
1.1.3.2. Gated Recurrent Units
1.1.3.3. Long Short-Term Memory (LSTM)
1.1.3.4. Applications of RNN to Modeling and Forecasting of Dynamic Systems
1.1.3.5. Recurrent State Space Models with Autonomous Adjusted Intervention Variable
1.2. DYNAMIC APPROACH TO DEEP LEARNING
1.2.1. Differential Equations for Neural Networks
1.2.2. Ordinary Differential Equations for ResNets
1.2.3. Ordinary Differential Equations for Reversible Neural Networks
1.2.3.1. Stability of Dynamic Systems
1.2.3.2. Second Method of Lyapunov
1.2.3.3. Lyapunov Exponent
1.2.3.4. Reversible ResNet
1.2.3.5. Residual Generative Adversarial Networks
1.2.3.6. Normalizing Flows
1.3. OPTIMAL CONTROL FOR DEEP LEARNING
1.3.1. Mathematic Formulation of Optimal Control
1.3.2. Pontryagin’s Maximum Principle
1.3.3. Optimal Control Approach to Parameter Estimation
1.3.4. Learning Nonlinear State Space Models
1.3.4.1. Joint Estimation of Parameters and Controls
1.3.4.2. Multiple Samples and Parameter Estimation
1.3.4.3. Optimal Control Problem
SOFTWARE PACKAGE
APPENDIX 1A: BRIEF INTRODUCTION OF TENSOR CALCULUS
1A1. Tensor Algebra
1A2. Tensor Calculus
APPENDIX 1B: CALCULATE GRADIENT OF CROSS ENTROPY LOSS FUNCTION
APPENDIX 1C: OPTIMAL CONTROL AND PONTRYAGIN’S MAXIMUM PRINCIPLE
1C1. Optimal Control
1C2. Pontryagin’s Maximum Principle
1C3. Calculus of Variation
1C4. Proof of Pontryagin’s Maximum Principle
EXERCISES
CHAPTER 2: Gaussian Processes and Learning Dynamic for Wide Neural Networks
2.1. INTRODUCTION
2.2. LINEAR MODELS FOR LEARNING IN NEURAL NETWORKS
2.2.1. Notation and Mathematic Formulation of Dynamics of Parameter Estimation Process
2.2.2. Linearized Neural Networks
2.3. GAUSSIAN PROCESSES
2.3.1. Motivation
2.3.2. Gaussian Process Models
2.3.3. Gaussian Processes for Regression
2.3.3.1. Prediction with Noise-Free Observations
2.3.3.2. Prediction with Noise Observations
2.4. WIDE NEURAL NETWORK AS A GAUSSIAN PROCESS
2.4.1. Gaussian Process for Single-Layer Neural Networks
2.4.2. Gaussian Process for Multilayer Neural Networks
APPENDIX 2A: RECURSIVE FORMULA FOR NTK CALCULATION
APPENDIX 2B: ANALYTIC FORMULA FOR PARAMETER ESTIMATION IN THE LINEARIZED NEURAL NETWORKS
EXERCISES
CHAPTER 3: Deep Generative Models
3.1. VARIATIONAL INFERENCE
3.1.1. Introduction
3.1.2. Variational Inference as Optimization
3.1.3. Variational Bound and Variational Objective
3.1.4. Mean-Field Variational Inference
3.1.4.1. A General Framework
3.1.4.2. Bayesian Mixture of Gaussians
3.1.4.3. Mean-Field Variational Inference with Exponential Family
3.1.5. Stochastic Variational Inference
3.1.5.1. Natural Gradient Decent
3.1.5.2. Revisit Variational Distribution for Exponential Family
3.2. VARIATIONAL AUTOENCODER
3.2.1. Autoencoder
3.2.2. Deep Latent Variable Models and Intractability of Likelihood Function
3.2.3. Approximate Techniques and Recognition Model
3.2.4. Framework of VAE
3.2.5. Optimization of the ELBO and Stochastic Gradient Method
3.2.6. Reparameterization Trick
3.2.7. Gradient of Expectation and Gradient of ELBO
3.2.8. Bernoulli Generative Model
3.2.9. Factorized Gaussian Encoder
3.2.10. Full Gaussian Encoder
3.2.11. Algorithms for Computing ELBO
3.2.12. Improve the Lower Bound
3.2.12.1. Importance Weighted Autoencoder
3.2.12.2. Connection between ELBO and KL Distance
3.3. OTHER TYPES OF VARIATIONAL AUTOENCODER
3.3.1. Convolutional Variational Autoencoder
3.3.1.1. Encoder
3.3.1.2. Bottleneck
3.3.1.3. Decoder
3.3.2. Graphic Convolutional Variational Autoencoder
3.3.2.1. Notation and Basic Concepts for Graph Autoencoder
3.3.2.2. Spectral-Based Convolutional Graph Neural Networks
3.3.2.3. Graph Convolutional Encoder
3.3.2.4. Graph Convolutional Decoder
3.3.2.5. Loss Function
3.3.2.6. A Typical Approach to Variational Graph Autoencoders
3.3.2.7. Directed Graph Variational Autoencoder
3.3.2.8. Graph VAE for Clustering
SOFTWARE PACKAGE
APPENDIX 3A
APPENDIX 3B: DERIVATION OF ALGORITHMS FOR VARIATIONAL GRAPH AUTOENCODERS
3B1. Evidence of Lower Bound
3B2. The Reparameterization Trick
3B3. Stochastic Gradient Variational Bayes (SGVB) Estimator
3B4. Neural Network Implementation
APPENDIX 3C: MATRIX NORMAL DISTRIBUTION
3C1. Notations and Definitions
3C2. Properties of Matrix Normal Distribution
EXERCISES
CHAPTER 4: Generative Adversarial Networks
4.1. INTRODUCTION
4.2. GENERATIVE ADVERSARIAL NETWORKS
4.2.1. Framework and Architecture of GAN
4.2.2. Loss Function
4.2.3. Optimal Solutions
4.2.4. Algorithm
4.2.5. Wasserstein GAN
4.2.5.1. Different Distances
4.2.5.2. The Kantorovich-Rubinstein Duality
4.2.5.3. Wasserstein GAN
4.3. TYPES OF GAN MODELS
4.3.1. Conditional GAN
4.3.1.1. Classical CGAN
4.3.1.2. Robust CGAN
4.3.2. Adversarial Autoencoder and Bidirectional GAN
4.3.2.1. Adversarial Autoencoder (AAE)
4.3.2.2. Bidirectional GAN
4.3.2.3. Anomaly Detection by BiGAN
4.3.3. Graph Representation in GAN
4.3.3.1. Adversarially Regularized Graph Autoencoder
4.3.3.2. Cycle-Consistent Adversarial Networks
4.3.3.3. Conditional Variational Autoencoder and Conditional Generative Adversarial Networks
4.3.3.4. Integrated Conditional Graph Variational Adversarial Networks
4.3.4. Deep Convolutional Generative Adversarial Network
4.3.4.1. Architecture of DCGAN
4.3.4.2. Generator
4.3.4.3. Discriminator Network
4.3.5. Multi-Agent GAN
4.4. GENERATIVE IMPLICIT NETWORKS FOR CAUSAL INFERENCE WITH MEASURED AND UNMEASURED CONFOUNDERS
4.4.1. Generative Implicit Models
4.4.2. Loss Function
4.4.2.1. Bernoulli Loss
4.4.2.2. Loss Function for the Generative Implicit Models
4.4.3. Divergence Minimization
4.4.4. Lower Bound of the f-Divergence
4.4.4.1. Tighten Lower Bound of the f-Divergence
4.4.5. Representation for the Variational Function
4.4.6. Single-Step Gradient Method for Variational Divergence Minimization (VDM)
4.4.7. Random Vector Functional Link Network for Pearson χ2 Divergence
SOFTWARE PACKAGE
EXERCISES
CHAPTER 5: Deep Learning for Causal Inference
5.1. FUNCTIONAL ADDITIVE MODELS FOR CAUSAL INFERENCE
5.1.1. Correlation, Causation, and Do-Calculus
5.1.2. The Rules of Do-Calculus
5.1.3. Structural Equation Models and Additive Noise Models for Two or Two Sets of Variables
5.1.4. VAE and ANMs for Causal Analysis
5.1.4.1. Evidence Lower Bound (ELBO) for ANM
5.1.4.2. Computation of the ELBO
5.1.5. Classifier Two-Sample Test for Causation
5.1.5.1. Procedures of the VCTEST (Figure 5.5)
5.2. LEARNING STRUCTURAL CAUSAL MODELS WITH GRAPH NEURAL NETWORKS
5.2.1. A General Framework for Formulation of Causal Inference into Continuous Optimization
5.2.1.1. Score Function and New Acyclic Constraint
5.2.2. Parameter Estimation and Optimization
5.2.2.1. Transform the Equality Constrained Optimization Problem into Unconstrained Optimization Problem
5.2.2.2. Compact Representation for the Hessian Approximation Ek and Limited-Memory-BFGS
5.2.3. VAE for Learning Structural Models and DAG among Observed Variables
5.2.3.1. Linear Structure Equation Model and Graph Neural Network Model
5.2.3.2. ELBO for Learning the Generative Model
5.2.3.3. Computation of ELBO
5.2.3.4. Optimization Formulation for Learning DAG
5.2.4. Loss Function and Acyclicity Constraint
5.2.4.1. OLS Loss Function
5.2.4.2. A New Characterization of Acyclicity
5.3. LATENT CAUSAL STRUCTURE
5.3.1. Latent Space and Latent Representation
5.3.2. Mapping Observed Variables to the Latent Space
5.3.2.1. Mask Layer
5.3.2.2. Encoder and Decoder for Latent Causal Graph
5.3.3. ELBO for the Log-Likelihood log pθ (Y | X)
5.3.4. Computation of ELBO
5.3.4.1. Encoder
5.3.4.2. Decoder
5.3.4.3. Learning Latent Causal Graph
5.3.5. Optimization for Learning the Latent DAG
5.4. CAUSAL MEDIATION ANALYSIS
5.4.1. Basics of Mediation Analysis
5.4.1.1. Univariate Mediation Model
5.4.1.2. Multivariate Mediation Analysis
5.4.1.3. Cascade Unobserved Mediator Model
5.4.1.4. Unobserved Multivariate Mediation Model
5.4.2. VAE for Cascade Unobserved Mediator Model
5.4.2.1. ELBO for Cascade Mediator Model
5.4.2.2. Encoder and Decoder
5.4.2.3. Test Statistics
5.5. CONFOUNDING
5.5.1. Deep Latent Variable Models for Causal Inference under Unobserved Confounders
5.5.2. Treatment Effect Formulation for Causal Inference with Unobserved Confounder
5.5.2.1. Decoder
5.5.2.2. Encoder
5.5.3. ELBO
5.6. INSTRUMENTAL VARIABLE MODELS
5.6.1. Simple Linear IV Regression and Mendelian Randomization
5.6.1.1. Two-Stage Least Square Method
5.6.1.2. Assumptions of IV
5.6.2. IV and Deep Latent Variable Models
5.6.2.1. Decoder
5.6.2.2. Encoder
5.6.2.3. ELBO
SOFTWARE PACKAGE
APPENDIX 5A: DERIVE EVIDENCE LOWER BOUND (ELBO) FOR ANM
APPENDIX 5B: APPROXIMATION OF EVIDENCE LOWER BOUND (ELBO) FOR ANM
APPENDIX 5C: COMPUTATION OF KL DISTANCE
APPENDIX 5D: BFGS AND LIMITED BFGS UPDATING ALGORITHM
APPENDIX 5E: NONSMOOTH OPTIMIZATION ANALYSIS
APPENDIX 5F: COMPUTATION OF ELBO FOR LEARNING SEMS
5F1. ELBO for SEMs
5F2. The Reparameterization Trick
5F3. Stochastic Gradient Variational Bayes (SGVB) Estimator
3F4. Neural Network Implementation
EXERCISES
CHAPTER 6: Causal Inference in Time Series
6.1. INTRODUCTION
6.2. FOUR CONCEPTS OF CAUSALITY FOR MULTIPLE TIME SERIES
6.2.1. Granger Causality
6.2.2. Sims Causality
6.2.3. Intervention Causality
6.2.4. Structural Causality
6.3. STATISTICAL METHODS FOR GRANGER CAUSALITY INFERENCE IN TIME SERIES
6.3.1. Bivariate Granger Causality Test
6.3.1.1. Bivariate Linear Granger Causality Test
6.3.1.2. Bivariate Nonlinear Causality Test
6.3.2. Multivariate Granger Causality Test
6.3.2.1. Multivariate Linear Granger Causality Test
6.3.3. Nonstationary Time Series Granger Causal Analysis
6.3.3.1. Background
6.3.3.2. Multivariate Nonlinear Causality Test for Nonstationary Time Series
6.3.4. Granger Causal Networks
6.3.4.1. Introduction
6.3.4.2. Architecture of Granger Causal Networks
6.3.4.3. Component-Wise Multilayer Perceptron (cMPL) for Inferring Granger Causal Networks
6.3.4.4. Component-Wise Recurrent Neural Networks (cRNNs) for Inferring Granger Causal Networks
6.3.4.5. Statistical Recurrent Units for Inferring Granger Causal Networks
6.4. NONLINEAR STRUCTURAL EQUATION MODELS FOR CAUSAL INFERENCE ON MULTIVARIATE TIME SERIES
SOFTWARE PACKAGE
APPENDIX 6A: TEST STATISTIC TNNG ASYMPTOTICALLY FOLLOWS A NORMAL DISTRIBUTION
APPENDIX 6B: HSIC-BASED TESTS FOR INDEPENDENCE BETWEEN TWO STATIONARY MULTIVARIATE TIME SERIES
6B1. Reproducing Kernel Hilbert Space
6B2. Tensor Product
6B3. Cross-Covariance Operator
6B4. The Hilbert-Schmidt Independence Criterion
EXERCISES
CHAPTER 7: Deep Learning for Counterfactual Inference and Treatment Effect Estimation
7.1. INTRODUCTION
7.1.1. Potential Outcome Framework and Counterfactual Causal Inference
7.1.2. Assumptions and Average Treatment Effect
7.1.3. Traditional Methods without Unobserved Confounders
7.1.3.1. Regression Adjustment
7.1.3.2. Propensity Score Methods
7.1.3.3. Doubly Robust Estimation (DRE) and G-Methods
7.1.3.4. Targeted Maximum Likelihood Estimator (TMLE)
7.2. COMBINE DEEP LEARNING WITH CLASSICAL TREATMENT EFFECT ESTIMATION METHODS
7.2.1. Adaptive Learning for Treatment Effect Estimation
7.2.1.1. Problem Formulation
7.2.2. Architecture of Neural Networks
7.2.3. Targeted Regularization
7.3. COUNTERFACTUAL VARIATIONAL AUTOENCODER
7.3.1. Introduction
7.3.2. Variational Autoencoders
7.3.2.1. CVAE
7.3.2.2. iVAE
7.3.3. Architecture of CFVAE
7.3.4. ELBO
7.3.4.1. Encoder
7.3.4.2. Decoder
7.3.4.3. Computation of the KL Distance
7.3.4.4. Calculation of ELBO
7.4. VARIATIONAL AUTOENCODER FOR SURVIVAL ANALYSIS
7.4.1. Introduction
7.4.2. Notations and Problem Formulation
7.4.3. Classical Survival Analysis Theory
7.4.4. Potential Outcome (Survival Time) and Censoring Time Distributions
7.4.5. VAE Causal Survival Analysis
7.4.5.1. Deep Latent Model
7.4.5.2. ELBO
7.4.5.3. Encoder
7.4.5.4. Decoder
7.4.5.5. Computation of the KL Distance
7.4.5.6. Calculation of ELBO
7.4.5.7. Prediction
7.4.6. VAE-Cox Model for Survival Analysis
7.4.6.1. Cox Model
7.4.6.2. Likelihood Estimation for the Cox Model
7.4.6.3. A Censored-Data Likelihood
7.4.6.4. Object Function for VAE-Cox Model
7.5. TIME SERIES CAUSAL SURVIVAL
7.5.1. Introduction
7.5.2. Multi-State Survival Models
7.5.2.1. Notations and Basic Concepts
7.5.3. Multi-State Survival Models
7.5.3.1. Transition Probabilities, the Kolmogorov Forward Equations and Likelihood Function
7.5.3.2. Likelihood Function with Interval Censoring
7.5.3.3. Ordinary Differential Equations (NODE) for Multi-State Survival Models
7.6. NEURAL ORDINARY DIFFERENTIAL EQUATION APPROACH TO TREATMENT EFFECT ESTIMATION AND INTERVENTION ANALYSIS
7.6.1. Introduction
7.6.2. Latent NODE for Irregularly-Sampled Time Series
7.6.3. Augmented Counterfactual ODE for Effect Estimation of Time Series Interventions with Confounders
7.6.3.1. Potential Outcome Framework for Estimation of Effect of Time Series Interventions
7.6.3.2. Augmented Counterfactual Ordinary Differential Equations
7.7. GENERATIVE ADVERSARIAL NETWORKS FOR COUNTERFACTUAL AND TREATMENT EFFECT ESTIMATION
7.7.1. A General GAN Model for Estimation of ITE with Discrete Outcome and Any Type of Treatment
7.7.1.1. Potential Framework
7.7.1.2. Conditional GAN as a General Framework for Estimation of ITE
7.7.2. Adversarial Variational Autoencoder-Generative Adversarial Network (AVAE-GAN) for Estimation in the Presence of Unmeasured Confounders
7.7.2.1. Architecture of AVAE-GAN
7.7.2.2. VAE with Disentangled Latent Factors
SOFTWARE PACKAGE
APPENDIX 7A: DERIVE EVIDENCE OF LOWER BOUND
APPENDIX 7B: DERIVATION OF KOLMOGOROV FORWARD EQUATIONS
APPENDIX 7C: INVERSE RELATIONSHIP OF THE KOLMOGOROV BACKWARD EQUATION
APPENDIX 7D: INTRODUCTION TO PONTRYAGIN’S MAXIMUM PRINCIPLE
APPENDIX 7E: ALGORITHM FOR ITE BLOCK OPTIMIZATION
APPENDIX 7F: ALGORITHMS FOR IMPLEMENTING STOCHASTIC GRADIENT DECENT
EXERCISES
CHAPTER 8: Reinforcement Learning and Causal Inference
8.1. INTRODUCTION
8.2. BASIC REINFORCEMENT LEARNING THEORY
8.2.1. Formalization of the Problem
8.2.1.1. Markov Decision Process and Notation
8.2.1.2. State-Value Function and Policy
8.2.1.3. Optimal Value Functions and Policies
8.2.1.4. Bellman Optimality Equation
8.2.2. Dynamic Programming
8.2.2.1. Policy Evaluation
8.2.2.2. Value Function and Policy Improvement
8.2.2.3. Policy Iteration
8.2.2.4. Monte Carlo Policy Evaluation
8.2.2.5. Temporal-Difference Learning
8.2.2.6. Comparisons: Dynamic Programming, Monte Carlo Methods, and Temporal Difference Methods
8.3. APPROXIMATE FUNCTION AND APPROXIMATE DYNAMIC PROGRAMMING
8.3.1. Introduction
8.3.2. Linear Function Approximation
8.3.3. Neural Network Approximation
8.3.4. Value-Based Methods
8.3.4.1. Q-Learning
8.3.4.2. Deep Q-Network
8.4. POLICY GRADIENT METHODS
8.4.1. Introduction
8.4.2. Policy Approximation
8.4.3. REINFORCE: Monte Carlo Policy Gradient
8.4.4. REINFORCE with Baseline
8.4.5. Actor–Critic Methods
8.4.6. n–Step Temporal Difference (TD)
8.4.6.1. n–Step Prediction
8.4.7. TD(λ) Methods
8.4.8. Sarsa and Sarsa (λ)
8.4.9. Watkin’s Q(λ)
8.4.10. Actor-Critic and Eligibility Trace
8.5. CAUSAL INFERENCE AND REINFORCEMENT LEARNING
8.5.1. Deconfounding Reinforcement Learning
8.5.1.1. Adjust for Measured Confounders
8.5.1.2. Proxy Variable Approximation to Unobserved Confounding
8.5.1.3. Deep Latent Model for Identifying the Proxy Variables of Confounders
8.5.1.4. Reward and Causal Effect Estimation
8.5.1.5. Variational Autoencoder for Reinforcement Learning
8.5.1.6. Encoder
8.5.1.7. Decoder and ELBO
8.5.1.8. Deconfounding Causal Effect Estimation and Actor-Critic Methods
8.5.2. Counterfactuals and Reinforcement Learning
8.5.2.1. Structural Causal Model for Counterfactual Inference
8.5.2.2. Bidirectional Conditional GAN (BiCoGAN) for Estimation of Causal Mechanism
8.5.2.3. Dueling Double-Deep Q-Networks and Augmented Counterfactual Data for Reinforcement Learning
8.6. REINFORCEMENT LEARNING FOR INFERRING CAUSAL NETWORKS
8.6.1. Instruction
8.6.2. Mathematic Formulation of Inferring Causal Networks Using Bidirectional Conditional GAN
8.6.3. Framework of Reinforcement Learning for Combinatorial Optimization
8.6.4. Graph Encoder and Decoder
8.6.4.1. Mathematic Formulation of Graph Embedding
8.6.4.2. Node Embedding
8.6.4.3. Shallow Embedding Approaches
8.6.4.4. Attention and Transformer for Combinatorial Optimization and Construction of Directed Acyclic Graph
SOFTWARE PACKAGE
APPENDIX 8A: BIDIRECTIONAL RNN FOR ENCODING
APPENDIX 8B: CALCULATION OF KL DIVERGENCE
EXERCISES
REFERENCES
INDEX