Statistical Genomics

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

This volume provides a collection of protocols from researchers in the statistical genomics field. Chapters focus on integrating genomics with other “omics” data, such as transcriptomics, epigenomics, proteomics, metabolomics, and metagenomics. Written in the highly successful Methods in Molecular Biology series format, chapters include introductions to their respective topics, lists of the necessary materials and reagents, step-by-step, readily reproducible laboratory protocols, and tips on troubleshooting and avoiding known pitfalls.

Cutting-edge and thorough, Statistical Genomics hopes that by covering these diverse and timely topics researchers are provided insights into future directions and priorities of pan-omics and the precision medicine era.

Author(s): Brooke Fridley, Xuefeng Wang
Series: Methods in Molecular Biology, 2629
Publisher: Humana Press
Year: 2023

Language: English
Pages: 376
City: New York

Preface
Contents
Contributors
Chapter 1: Multi-omics Data Deconvolution and Integration: New Methods, Insights, and Translational Implications
1 Introduction
2 Emerging Methods for Multi-omics Assays: From Deconvolution to Integration
3 Concluding Remarks
References
Chapter 2: Statistical and Machine Learning Methods for Discovering Prognostic Biomarkers for Survival Outcomes
1 Introduction
2 Methods
2.1 Examination of Survival Outcomes
2.2 Univariable and Multivariable Cox Regression Analysis Pipeline
2.3 Prognostic Biomarker Screening Based on the Lasso and Elastic Net Methods
2.4 Efficient Gradient Boosting Methods
2.5 Deep Learning Methods
3 Conclusions
References
Chapter 3: Cell-Type Deconvolution of Bulk DNA Methylation Data with EpiSCORE
1 Introduction
2 Materials
2.1 Software and Hardware Required
2.2 Installation of EpiSCORE R Package
2.3 Single-Cell RNA-Seq Datasets Used in Case Study
2.4 DNAm Datasets Used in Case Study
3 Methods
3.1 Overview of EpiSCORE
3.2 Construction of a Tissue-Specific scRNA-Seq Reference Matrix
3.3 Imputation of DNAm Reference
3.4 Estimation of Cell-Type Fractions and Identification of Cell-Type-Specific Differential DNAm
3.5 Case Study
3.6 Summary
4 Notes
References
Chapter 4: Profiling Cellular Ecosystems at Single-Cell Resolution and at Scale with EcoTyper
1 Introduction
2 Materials
3 Methods
3.1 Discovery of Cell States and Ecotypes from Bulk Data
3.1.1 Cell-Type Fraction Estimation
3.1.2 Cell-Type Expression Purification
3.1.3 Cell State Discovery
3.1.4 Determining the Number of Cell States
3.1.5 Cell State Quality Control
3.1.6 Ecotype (Cellular Community) Discovery
3.1.7 Configuring the EcoTyper Run
3.1.8 Input Section of Configuration File
Discovery Dataset Name
Expression Matrix
Cell-Type Fractions
Expression Type
Annotation File (Optional)
Annotation File Column to Scale by (Optional)
Annotation File Column(s) to Plot
CIBERSORTx Username and Token
3.1.9 Output Section of Configuration File
3.1.10 Pipeline Settings
Pipeline Steps to Skip
Number of Threads
Number of NMF Restarts
Maximum Number of States per Cell Type
Cophenetic Coefficient Cutoff
CIBERSORTx Fractions/HiRes Singularity Path
3.1.11 Run Bulk Discovery Mode
3.1.12 Overview of Output Files
3.1.13 Example Downstream Analyses
3.2 Discovery of Cell States and Ecotypes from scRNA-Seq Data
3.2.1 Extract Cell-Type-Specific Genes from scRNA-seq Input Matrix (Optional)
3.2.2 Cell State Discovery
Round One
Number of Cell States per Cell Type
Round Two
3.2.3 Ecotype Discovery
3.2.4 Configuring the EcoTyper Run
3.2.5 Input Section of Configuration File
Expression Matrix
Annotation File
3.2.6 Output Section of Configuration File
3.2.7 Pipeline Settings
Filter Non-Cell-Type-Specific Genes
Jaccard Matrix p-Value Cutoff
3.2.8 Run Single-Cell Discovery Mode
3.2.9 Overview of Output Files
3.2.10 Example Downstream Analysis
3.3 Recovery of Cell States and Ecotypes from Bulk Samples
3.3.1 Run Recovery from Bulk Data
3.3.2 Overview of Output Files
3.4 Recovery of Cell States and Ecotypes from Single-Cell RNA-seq Data
3.4.1 Run Recovery from scRNA-seq Data
3.4.2 Overview of Output Files
3.5 Recovery of Cell States and Ecotypes from Spatial Transcriptomics Data
3.5.1 Overview of Input Files
3.5.2 Configuration File
Input Visium Directory
Recovery Cell-Type Fractions
Background Cell Type
3.5.3 Run Recovery from Spatial Transcriptomics Data
3.5.4 Output Files
3.5.5 Downstream Analysis
3.6 Conclusion
4 Notes
References
Chapter 5: Statistical Methods for Integrative Clustering of Multi-omics Data
1 Introduction
2 Model-Based Methods of Integrative Clustering
2.1 The iCluster Method
2.2 The iClusterPlus Method
2.3 The iClusterBayes Method
3 Nonparametric Methods of Integrative Clustering
3.1 Similarity Network Fusion (SNF)
3.2 Perturbation Clustering (PINS)
3.3 Neighborhood-Based Multi-omics Clustering (NEMO)
4 Integrative NMF Clustering (intNMF)
5 Preparation of Multi-omics Data for Integrative Clustering Analysis
6 Integrative Clustering Analyses of Uveal Melanoma
7 Integrative Clustering Analyses of Lower-Grade Glioma (LGG)
8 Conclusion
References
Chapter 6: Analysis of Single-Cell RNA-seq Data
1 Introduction
2 Preliminaries
2.1 Code Setup
2.2 Data Description
3 Quality Control and Normalization
3.1 Read-Level Quality Control and Bioinformatic Processing
3.2 Cell-Level Quality Control
3.3 Normalization
3.4 Data Integration and Batch Effect Removal
4 Dimension Reduction
4.1 Feature Selection
4.2 Dimensionality Reduction
5 Cell-Type Annotation
6 Downstream Statistical Analysis
6.1 Differential Expression Analysis
6.2 Trajectory Inference
7 Summary
References
Chapter 7: A Primer on Preprocessing, Visualization, Clustering, and Phenotyping of Barcode-Based Spatial Transcriptomics Data
1 Introduction
2 Pre-processing and Quality Control
3 Visualization
4 Clustering
5 Phenotyping of Sampling Units
6 Final Remarks
References
Chapter 8: Statistical Analysis of Multiplex Immunofluorescence and Immunohistochemistry Imaging Data
1 Introduction
2 Ovarian and Lung Cancer Datasets
3 Methods
3.1 Image Transformation, Normalization, and Batch Correction
3.1.1 Image Normalization
3.1.2 Batch Correction
3.1.3 Future Research and Software Development
3.2 Cell Phenotyping
3.2.1 Marker Gating
3.2.2 Unsupervised Clustering Algorithms
3.3 Analysis of Cellular Composition and Marker Expression
3.3.1 Modeling Counts or Proportions with Overdispersion and Zero Inflation
3.3.2 Analysis of Functional Markers
3.4 Spatial Analysis
3.4.1 Basic Tools for Data Exploration and Visualization
3.4.2 Spatial Statistical Learning Models
3.4.3 Spatial Metrics Based on Point Processes
3.4.4 Point Process Adaptations for mIF
3.5 Software
4 Conclusions
References
Chapter 9: Statistical Analysis in ChIP-seq-Related Applications
1 Introduction
2 Methods
2.1 Quality Control
2.2 Peak Detection
2.3 Signal Reproducibility
2.4 Normalization Strategies
2.5 Bias and Batch Correction
2.6 Differential Binding Analysis
2.7 Peak Annotation
2.8 Multimodal Integration with Other Genomic Information
3 Summary
References
Chapter 10: Bioinformatic and Statistical Analysis of Microbiome Data
1 Introduction
2 Datasets Used to Illustrate the Methods
3 Bioinformatic and Statistical Methods for Microbiome Data Analysis
3.1 Overview of Bioinformatic Pipeline for Raw Sequencing Data Analysis
3.2 Bioinformatic Analysis of Marker-Gene Sequencing Data
3.2.1 Sequencing Error Control and Variant Call
3.2.2 Taxonomic Classification
3.2.3 Phylogenetic Tree Construction
3.3 Bioinformatic Analysis of Metagenome Shotgun Sequencing Data
3.3.1 Quality Control and Decontamination
3.3.2 Reference-Based Taxonomy Identification
3.3.3 Reference-Based Functional Classification
3.3.4 De Novo Metagenomic Assembly Analysis
3.4 Statistical Analysis of Microbiome Data
3.4.1 Structure of Microbiome Data
3.4.2 Property of Microbiome Data
3.4.3 Quality Control of Microbiome Data
3.4.4 Normalization of Microbiome Data
3.4.5 Exploratory Analysis of Microbiome Data
3.4.6 Alpha Diversity
3.4.7 Beta Diversity
3.4.8 Microbiome-Wide Association Analysis
3.4.9 Community-Level Association Analysis Based on Alpha Diversity
3.4.10 Community-Level Association Analysis Based on Beta Diversity
3.4.11 Biodiversity-Free Test of Microbiome Community Association
3.4.12 Univariate Feature-Wise Associated Analysis Methods
3.4.13 Visualization of Univariate Association Analysis
3.4.14 Machine Learning Methods for Microbial Biomarker Discovery
4 Conclusions
References
Chapter 11: Statistical and Computational Methods for Microbial Strain Analysis
1 Introduction
2 Resolving Strain Mixtures from Metagenomic Shotgun Sequencing Data
2.1 Strain Mixtures and Observed Frequency Matrix
2.2 Resolving Strain Mixtures Formulated as an Optimization Problem
3 Technical Considerations for Resolving Microbial Strain Mixtures
3.1 Reference-Based Versus Assembly-Based Approaches
3.2 Assembly-Graph-Based Approaches
3.3 Tracking Strains Across Multiple Samples
3.4 Selecting the Optimal Number of Strains in the Mixture
4 Computation Considerations
5 Discussion and Future Directions
References
Chapter 12: Statistics and Machine Learning in Mass Spectrometry-Based Metabolomics Analysis
1 Introduction
2 ADNI Dataset
3 Missing Values Imputation Approaches
3.1 Simple Imputation
3.2 kNN-Based Methods
3.3 Regression-Based Methods
3.4 Distribution-Based Sampling Methods
3.5 Imputation with Random Forest
4 Normalization Methods and Tools
4.1 Data-Driven Normalizations
4.2 Internal Standard-Based Normalizations
4.3 Quality Control Samples-Based Normalizations
4.3.1 Batch-Wise LOESS Normalization
4.3.2 SERRF Normalization
5 Statistical Models for Metabolomics and Clinical Covariates
5.1 Metabolic Biomarkers
5.2 Clustering of Metabolomes
5.3 Modules of Metabolites
5.4 Integrative Analysis of Metabolomics and Transcriptomics
6 Conclusion
References
Chapter 13: Statistical and Computational Methods for Proteogenomic Data Analysis
1 Introduction
1.1 Why Study Protein?
1.2 Design of Mass Spectrometry-Based Proteomic Experiments
2 Dataset
2.1 Large Data Repository
2.2 Dataset for Method Demonstration
3 Preprocessing and Quality Control
3.1 Global Normalization
3.2 Outlier Detection and Removal
3.3 Batch Effect Identification and Correction
3.4 Missing Data Imputation
4 Integrative Proteogenomic Analysis
4.1 Multi-Omic Association Analysis
4.2 Joint Network Construction
4.3 Clustering
4.4 Cell-Type Deconvolution
References
Chapter 14: Pharmacogenomic and Statistical Analysis
1 Introduction
2 Data Generation for Pharmacogenomic Discovery Analyses
2.1 Population-Based Study Designs
2.2 A Population-Based Study of Warfarin Sensitivity
2.3 Cell-Line-Based Studies
2.4 Cell-Line-Based Studies of Chemotherapeutic Toxicity and Response
2.5 Genotyping, Sequencing, and Quality Control (QC)
3 Statistical Analysis
3.1 Association Analysis for Common Variants
3.2 A Common Missense Variant Associates to Leukopenia Adverse Events in Thiopurine Treatment
3.3 Phasing, Imputation, and Haplotype Methods
3.4 Haplotype Effects in the VKORC1 Gene Associate to Warfarin Dose
3.5 Mixed Models in Pharmacogenomic Studies
3.6 Examples of Heritability Estimation for Pharmacogenomic Traits
3.7 Rare Variant Analysis for Pharmacogenomic Traits
3.8 Protein-Based Analyses of Rare Variants Influencing Lipid Traits
3.9 Gene Set and Enrichment Analyses
4 Translation of Pharmacogenomic Findings and Future Approaches
References
Chapter 15: Statistical Methods for Disease Risk Prediction with Genotype Data
1 Introduction
2 Quantifying Prediction Performance
2.1 Measurement of Prediction Performance for Continuous Phenotype
2.2 Measurement of Prediction Performance for Binary Phenotype
3 Prediction by the Polygenic Risk Score
3.1 Basic Form of the PRS
3.2 Variations of the PRS
3.2.1 LD-Pruning and P-Value Thresholding
3.2.2 LDpred
3.2.3 Influential Factors to Prediction Accuracy of the PRS
3.2.4 Real Data Example: Predicting Schizophrenia Risk by the PRS
3.2.5 Trans-Ancestry Application of the PRS
3.3 Prediction for Related Phenotypes by the PRS
3.4 Clinical Utility of the PRS
4 Prediction by the Linear Mixed Model (LMM)
4.1 The Linear Mixed Model Basics and Variations
4.2 Prediction in Non-European Populations by the LMM
4.3 Implementation Issues of the LMM and Solutions
5 Prediction by the Penalized Regression
6 Population Stratification Control in Disease Prediction
6.1 Detecting Population Stratification in Genotype Data
6.2 Controlling Population Stratification by the Principal Component Analysis (PCA)
7 Summary and Outlook
References
Chapter 16: Statistical Methods Inspired by Challenges in Pediatric Cancer Multi-omics
1 Introduction
2 T-ALL in Pediatric and Young Adults
3 Notation
4 Multiple-Testing Adjustments
5 Genomic Random Intervals
5.1 The Null Model of the GRIN Method
5.2 GRIN Constellation Tests
5.3 GRIN Analysis of T-ALL Data Set
6 Association of Lesions with Expression
6.1 Association of Lesions with Expression (ALEX)
6.2 ALEX Analysis of T-ALL Example
7 PROMISE
7.1 Background
7.2 Definition of PROMISE
7.3 Significance Determination of PROMISE
7.4 PROMISE Analysis of Gene Expression in T-ALL
8 Discussion
References
Index