Statistics and Machine Learning Methods for EHR Data: From Data Extraction to Data Analytics

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

The use of Electronic Health Records (EHR)/Electronic Medical Records (EMR) data is becoming more prevalent for research. However, analysis of this type of data has many unique complications due to how they are collected, processed and types of questions that can be answered. This book covers many important topics related to using EHR/EMR data for research including data extraction, cleaning, processing, analysis, inference, and predictions based on many years of practical experience of the authors. The book carefully evaluates and compares the standard statistical models and approaches with those of machine learning and deep learning methods and reports the unbiased comparison results for these methods in predicting clinical outcomes based on the EHR data.

Key Features:

  • Written based on hands-on experience of contributors from multidisciplinary EHR research projects, which include methods and approaches from statistics, computing, informatics, data science and clinical/epidemiological domains.
  • Documents the detailed experience on EHR data extraction, cleaning and preparation
  • Provides a broad view of statistical approaches and machine learning prediction models to deal with the challenges and limitations of EHR data.
  • Considers the complete cycle of EHR data analysis.

The use of EHR/EMR analysis requires close collaborations between statisticians, informaticians, data scientists and clinical/epidemiological investigators. This book reflects that multidisciplinary perspective.

Author(s): Hulin Wu, Jose Miguel Yamal, Ashraf Yaseen, Vahed Maroufy
Series: Chapman & Hall/CRC Healthcare Informatics Series
Publisher: CRC Press
Year: 2020

Language: English
Pages: 327
City: Boca Raton

Cover
Half Title
Series Page
Title Page
Copyright Page
Contents
Preface
About the Editors
Contributors
1. Introduction: Use of EHR Data for Scientific Discoveries--Challenges and Opportunities
1.1. Real-World Data and Real-World Evidence: Big Data in Practice
1.2. Use of EMR/EHR Database for Research and Scientific Discoveries: Procedure and Life Cycle
1.2.1. Initiate a Project
1.2.2. Data Queries and Data Extraction
1.2.3. Data Cleaning
1.2.4. Data Pre-Processing or Processing
1.2.5. Data Preparation
1.2.6. Data Analysis, Modeling and Prediction
1.2.7. Result Validation
1.2.8. Result Interpretation
1.2.9. Publication and Dissemination
1.3. Challenges and Opportunities
References
2. EHR Project Management
2.1. Introduction
2.1.1. What is Project Management?
2.1.2. Why We Need Project Management?
2.1.3. Project Management Goals and Principles
2.2. Project and Sub-Project in EHR Research
2.3. Data, Code and Product Management
2.3.1. Data Loss Prevention
2.3.2. Naming Conventions
2.3.3. Version Control
2.3.4. Coding Convention
Object-Oriented or Non-Object-Oriented Programming
2.3.5. Document Management: Data Analysis Report, Papers and Read-Me Documents
2.4. Team/People Management
2.4.1. How to Form a Team: What Expertise is Needed for EHR Projects?
2.4.2. How to Efficiently Manage a Multidisciplinary Team?
2.4.3. Task Management
2.5. Management Methods and Software Tools
2.6. An Example of a Data Management Framework
2.6.1. Folder Management
Naming
Structure
Main Folders
CBD_HS
Public_Folder
Admin
Useful_Info
Group Folders
Project Folders
Sub_Project Folders
2.6.2. File Management
Naming
Structure
File Submission
2.6.3. User Management
User Groups
2.6.4. Data Management Framework
2.7. Discussion and Summary
2.8. Appendix--File Submission Form
Note
References
3. EHR Databases and Data Management: Data Query and Extraction
3.1. Introduction
3.2. EHR/EMR Database Availability and Access
3.3. EHR/EMR Database Design and Structure: Database Queries
3.3.1. Database Construction
3.3.2. Traditional Relational Database System
3.3.3. Distributed Database System
3.4. Data Extraction
3.4.1. Define Inclusion/Exclusion Criteria for Data Extraction
3.4.2. Phenotyping: Cohort Identification
3.5. Data Extraction Report
3.6. Illustration Example: Subarachnoid Hemorrhage (SAH) Project
3.6.1. EHR Database Design and Construction
3.6.2. SAH Cohort Identification and Data Extraction
3.6.3. Data Extraction Report
3.6.4. Potential Data Extraction Pitfalls and Errors with Solutions
References
4. EHR Data Cleaning
4.1. Introduction
4.2. Review of Current Data Cleaning Methods and Tools
4.2.1. Data Wranglers
4.2.2. Data Cleaning Tools for Specific EHR Datasets
4.2.3. Data Quality Assessment
4.3. Common EHR Data Errors and Fixing Methods
4.3.1. List of Common Errors in an EHR Database
4.3.2. Demographics Table
Multiple Race and Gender
Multiple Patient Keys for the Same Encounter ID
Multiple Calculated Birth Date
4.3.3. Lab Table
Developing Conversion Map
Conversion Map ID
Convert To
Conversion Equation
The Lower Limit and Upper Limit
Lab Date and Time
User Input Form and Report Generator
Output
4.3.4. Clinical Event Table
Variable Combining
Information Recovery
A Case Study
Overlap of Different Tables
Correction of Misinformation
4.3.5. Diagnosis and Medication Table
4.3.6. Procedure Table
Introduction to the Procedure Code Data
Procedure Table Data Cleaning
4.4. Discussion
Acknowledgments
Notes
References
5. EHR Data Pre-Processing and Preparation
5.1. Introduction
5.1.1. Definition of Data Pre-Processing/Processing
5.1.2. Definition of Data Preparation
5.2. Data Pre-Processing
5.2.1. Tidy Data Principles
Variable Encoding
5.2.2. Feature Extraction: Derived Variables
5.2.3. Dimension Reduction
Variable Grouping or Clustering
Principle Component Analysis (PCA)
Embedding and Deep Learning
5.2.4. Missing Data Imputation
5.3. Data Preparation
5.3.1. Define the Endpoint or Outcome
5.3.2. Process Medical Record Timestamps
5.3.3. Define the Encounter Time Interval
5.3.4. Encounter Combination
5.3.5. Define Comparison Groups
5.3.6. Cohort Refining
5.3.7. Leakage Detection
5.3.8. Data Preparation for Different Analysis Purposes
5.4. Data Processing/Preparation Errors and Pitfalls with Solutions
5.5. Data Pre-Processing and Preparation Report
5.6. Summary
References
6. Missing Data Issues in EHR
6.1. Introduction and Overview
6.2. Missing Data Mechanisms
6.3. Methods for Incomplete EHR Data
6.3.1. Naïve Method
6.3.2. Imputation Using Statistical Models
6.3.3. Machine Learning and Deep Learning Models
6.3.4. Choice of Best Method for EHR Data
6.4. Case Study
6.4.1. Missing Condition in EHR Data
6.4.2. Missing Imputation in EHR Datasets
6.4.3. Evaluating the Performance of Imputation Methods and Thresholds
6.5. Discussion and Conclusion
References
7. Causal Inference and Analysis for EHR Data
7.1. Introduction
7.1.1. Why Causal Inference
7.1.2. Overview of Causal Inference Methods: Rubin Causal Model (RCM)
7.1.3. Basic Framework in Causality: Potential Outcome Framework
Average and Individual Treatment Effects
7.2. Propensity Scoring
7.2.1. Brief Introduction
7.2.2. Propensity Scoring for Binary Treatments
7.2.3. Propensity Scoring for Multiple Treatments
7.2.4. Propensity Scoring for Ordinal Treatments
7.2.5. Propensity Score Estimation for Complex Data Sets
7.2.6. Illustration Example: Subarachnoid Hemorrhage (SAH) Project
7.3. Mediation Analysis
7.3.1. Introduction to Mediation Analysis
7.3.2. The Product Method
7.3.3. The Difference Method
7.3.4. Other Considerations
7.4. Instrumental Variables Networks for Treatment Effect Estimation in the Presence of Unmeasured Confounders
7.4.1. Instrumental Variables Frameworks
7.4.2. Two-Stage Least Square Methods with Linear Models
Simple Linear Models
Covariance Analysis
Generalized Least Square Estimator
Two-Stage Least Square Method
Nonlinear Models for Two-Stage Least Squares Approach
7.5. Learning Treatment Effect by Generative Adversarial Networks
7.5.1. Introduction
7.5.2. CGANs as a General Framework for Estimation of Individualized Treatment Effects
The Architecture of CGANs for Generating Potential Outcomes
CGANs for Estimating ITEs
CGANs for Estimating ITEs in Survival Analysis
7.5.3. Wasserstein GANs for Estimation of Individualized Treatment Effects
7.5.4. MisCGANs for Estimation of Individualized Treatment Effects
The General Process for Incompletely Observed Data
MisGAN for Counterfactual Imputation
7.5.5. Optimal Treatment Selection
Sparse Techniques for Biomarker Identification
Biomarker Identification for Optimal Treatment Selection
7.6. Deconfounder in Estimation of Treatment Effects
7.6.1. Introduction
7.6.2. Causal Models with Latent Confounders
7.6.3. Adversarial Learning Confounders
7.6.4. Loss Function and Optimization for Estimating ITEs in the Presence of Confounders
7.7. Targeted Maximum Likelihood Estimation
7.8. Supplementary Note A
7.8.1. Wasserstein GAN
A1 Different Distances
A1.1 Maximum Likelihood Estimation
A1.2 Total Variation (TV) Distance
A1.3 The Kullback-Leibler (KL) Divergence
A1.4 The Jenson-Shannon (JS) Divergence
A1.5 Earth Mover (EM) or Wasserstein Distance
A2 Wasserstein GAN
A3 Algorithm (WGAN)
References
8. EHR Data Exploration, Analysis and Predictions: Statistical Models and Methods
8.1. Introduction
8.1.1. Statistical Challenges for EHR Data
8.1.2. Overview of Existing Methods
8.2. Data Exploration and Visualization
8.3. Statistical Models for EHR Data
8.3.1. Contingency Tables
8.3.2. Chi-Square Test
8.3.3. Hypergeometric Test
8.4. GLM
8.5. Survival Model
8.6. Mixed-Effect Models
8.7. Time Series Analysis
8.7.1. AR, MA and ARMA Model
8.7.2. Gaussian Process
8.8. Variable Selection Methods
8.8.1. Stepwise Variable Selection
8.8.2. Purposeful Variable Selection
8.8.3. SIS
8.8.4. Penalty-Based Methods
8.9. Divide-and-Conquer Method
8.10. Validation
8.11. Results and Examples
8.12. Discussions and Conclusions
References
9. Neural Network and Deep Learning Methods for EHR Data
9.1. Introduction
9.2. Deep Learning Methods for EHR Data
9.3. Deep Learning Software Tools and Implementation
9.4. Application Examples
Case Study 1: Application of MLP for Mortality Prediction
Case Study 2: Application of RNN for Heart Failure Prediction for Hypertension Patients
Experimental Setting
RNN Prediction Results
9.5. Discussion
References
10. EHR Data Analytics and Predictions: Machine Learning Methods
10.1. Machine Learning Overview
10.2. Machine Learning Methods
Random Forest
Extremely Randomized Tree
Gradient Boosting
XgBoost
Support Vector Machine (SVM)
10.3. Machine Learning Software Tools
H2O
Caret
TPOT
Auto-sklearn
10.4. Application Example: SAH Project
Prediction Scenarios
Predictors and Outcome Data
Model Training
Result of Outcome Prediction
Evaluation of Model Performance
10.5. Conclusion and Recommendation
References
11. Use of EHR Data for Research: Future
11.1. Future EHR Research
11.2. Post-Research Practice
11.3. Summary
References
Index