Biological Pattern Discovery With R: Machine Learning Approaches

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

This book provides the research directions for new or junior researchers who are going to use machine learning approaches for biological pattern discovery. The book was written based on the research experience of the author's several research projects in collaboration with biologists worldwide. The chapters are organised to address individual biological pattern discovery problems. For each subject, the research methodologies and the machine learning algorithms which can be employed are introduced and compared. Importantly, each chapter was written with the aim to help the readers to transfer their knowledge in theory to practical implementation smoothly. Therefore, the R programming environment was used for each subject in the chapters. The author hopes that this book can inspire new or junior researchers' interest in biological pattern discovery using machine learning algorithms.

Author(s): Zheng Rong Yang
Publisher: World Scientific Publishing
Year: 2021

Language: English
Pages: 462
City: Singapore

Contents
Preface
1 Introduction
1.1 The responsive gene discovery problem
1.2 The peptide function discovery problem
1.3 The molecular interaction discovery problem
1.4 The spectral molecular discovery problem
1.5 The whole-genome pattern discovery problem
1.6 The global optimisation pattern discovery problem
1.7 The chapters
2 Responsive Gene Discovery
2.1 A biological question — essential gene discovery
2.2 Density estimation
2.2.1 The histogram approach
2.2.2 The parametric approach
2.2.3 The non-parametric approach
2.2.3.1 The kernel method
2.2.3.2 The K-nearest neighbour approach
2.2.4 The semi-parametric approach
2.2.4.1 The Gaussian mixture
2.2.4.2 Gamma mixture
2.2.5 The multivariate density estimation
2.3 Cluster analysis
2.3.1 The hierarchical cluster analysis algorithm
2.3.2 The K-means cluster analysis algorithm
2.3.3 The fuzzy C-means cluster analysis algorithm
2.3.4 The mixture model cluster analysis algorithm
2.3.5 The other clustering algorithms
2.4 The gene essentiality pattern discovery problem
2.4.1 The data
2.4.2 The properties of the transposon statistics
2.4.3 Gene essentiality pattern discovery using univariate models
2.4.4 Gene essentiality pattern discovery using multivariate models
2.4.4.1 The multi-statistics multivariate model
2.4.4.2 The multi-replicate multivariate model
Summary
3 Protease Cleavage Pattern Discovery
3.1 A biology question — protease cleavage
3.2 The linear discriminant analysis algorithm
3.2.1 The definition and working principle of LDA
3.2.2 The projection direction optimisation
3.2.3 The formulation of LDA
3.2.4 Making decision using the Bayes rule for a LDA model
3.2.5 The R function for LDA
3.3 The other analytic discriminant analysis algorithms
3.3.1 The quadratic discriminant analysis algorithm
3.3.2 The Naïve Bayes algorithm
3.3.3 The logistic regression algorithm
3.3.4 The Bayesian linear discriminant analysis
3.4 Evaluation and generalisation of a supervised machine learning model
3.4.1 Confusion matrix
3.4.2 Receiver operating characteristic analysis
3.4.3 Generalisation
3.5 Example
3.6 Nonlinear algorithms
3.6.1 Multi-layer perceptron
3.6.1.1 The structure of MLP
3.6.1.2 The learning mechanism of MLP
3.6.1.3 From SLP (LDA) to MLP
3.6.1.4 The R packages for MLP
3.6.2 Radial basis function neural network
3.6.3 The bio-basis function neural network algorithm
3.6.3.1 The bio-basis function neural network algorithm
3.6.3.2 The Bayesian BBFNN algorithm
3.6.3.3 The orthogonal kernel machine
3.6.4 The support vector machine algorithm
3.6.5 The relevance vector machine algorithm
3.6.6 Deep neural network
3.6.7 Inductive learning
3.6.7.1 The working principle of inductive learning
3.6.7.2 The purity measurements
3.6.7.3 The classification and regression tree algorithm
3.6.7.4 The C50 algorithm
3.6.7.5 Seeds classification
3.6.7.6 Factor Xa protease cleavage data classification
3.6.8 The random forest algorithm
Summary
4 Genetic-Epigenetic Interplay Discovery
4.1 A biological question — the genetic-epigenetic interplay pattern discovery problem
4.2 Regression analysis
4.3 The ordinary linear regression analysis algorithm
4.3.1 The least squared error approach
4.3.2 Assess the fitness of a regression model
4.3.3 The significance analysis of regression coefficients
4.3.4 The regression model confidence bands
4.3.5 R function for ordinary linear regression analysis
4.4 The generalised additive model algorithm
4.5 The Bayesian linear regression algorithm
4.6 The constrained regression analysis algorithms
4.6.1 The ridge linear regression algorithm
4.6.2 The Lasso linear regression algorithm
4.6.3 The elastic net linear regression algorithm
4.7 Ranking variables using the vip package
4.9 Epigenetic-genetic interplay pattern discovery
4.9.1 Methylation site to gene — the M2E models
4.9.2 Gene to methylation site association — E2M models
Summary
5 Spectral Pattern Discovery
5.1 A biology question
5.2 Introduction of baseline estimation approaches
5.3 The Whittaker-Henderson algorithm
5.4 The spline smoother
5.5 The adaptive iterative reweighted penalised least square smoother
5.6 The asymmetric least square smoother
5.7 The Bayesian Whittaker-Henderson algorithm
5.7.1 The working principle of BWH
5.7.2 The smoothing of the extracted peak spectrum
5.7.3 The generation of the merged and unique peaks
5.7.4 The fitness of a BWH model
5.7.5 Aligning peaks for replicated spectra
5.8 Analyse the milk spectra data
5.9 Analyse the bacterial and macrophage data
Summary
6 Gene Expression Pattern Discovery
6.1 Differentially expressed genes
6.1.1 The biological significance
6.1.2 The statistical significance
6.1.3 The Type I and Type II errors
6.2 Microarray gene expression analysis
6.2.1 The limma package
6.2.2 The visualisation of the discovered DEGs using the MA plot
6.2.3 The visualisation of the discovered DEGs using the volcano plot
6.2.4 How to discover DEGs using the limma package
6.3 DEG discovery for RNA-seq sequencing count data
6.3.1 Discover DEGs for sequencing count data using DESeq2
6.3.2 Discover DEGs for sequencing count data using edgeR
6.4 Discover differentially expressed genes when outliers are present
6.4.1 Example of heterogeneous gene expression
6.4.2 COPA
6.4.3 OS
6.4.4 ORT
6.4.5 MOST
6.4.6 LSOSS
6.4.7 DOG
6.4.8 Discover DEGs when outlier genes are present — simulated data
6.4.9 Discover heterogenous DEGs for a cancer data set
6.5 Gene expression bimodality pattern discovery
6.5.1 The likelihood ratio test approach
6.5.2 The bimodality index test approach
6.5.3 The gap maximisation test approach
6.5.4 Simulated data analysis
6.5.5 Letrozole data analysis
6.6 Dual-scale Gaussian model for small replicate data DEG discovery
6.6.1 The dual-scale Gaussian model
6.6.1.1 The working principle of DSG
6.6.1.2 DSG for simulated data DEG discovery
6.6.2 A real data set study
Summary
7 Whole Genome Pattern Discovery
7.1 The SARS-CoV-2 pandemic
7.2 Sequence alignment
7.2.1 The issues of sequence alignment
7.2.1.1 The three evolution events
7.2.1.2 The alignment gap
7.2.1.3 The alignment strategy
7.2.1.4 The alignment statistic
7.2.2 The Sellers algorithm
7.2.2.1 The forward propagation stage
7.2.2.2 The backward propagation stage
7.2.3 The Needleman-Wunsch algorithm
7.2.3.1 The initialisation stage
7.2.3.2 The forward propagation stage
7.2.3.3 The backward propagation stage
7.2.3.4 The R library for the Needleman-Wunsch algorithm
7.2.4 The Smith-Waterman algorithm
7.2.4.1 The alignment metric and moving directions
7.2.4.2 The initialisation
7.2.4.3 The forward propagation
7.2.4.4 The backward propagation stage
7.2.4.5 The R library for the Smith-Waterman algorithm
7.3 Alignment-based multiple sequence comparison
7.4 Alignment-free multiple sequence comparison
7.4.1 The k-mers approach
7.4.2 The alignment-based approach versus the alignment-free approach for sequence comparison
7.4.2.1 The speed comparison
7.4.2.2 The accuracy comparison
7.4.2.3 The pattern discovery power
7.5 K-mer machine
7.6 Whole genome pattern discovery for SARS-CoV-2
7.6.1 Genomics distribution of sequences
7.6.2 Discrimination between countries based on genomics pattern
7.6.3 Genomics pattern evolving with time
Summary
8 Optimised Peptide Pattern Discovery
8.1 A biological question — protease cleavage pattern discovery
8.2 Introduction
8.3 Genetic programming
8.3.1 The genetic algorithm
8.3.2 The genetic programming algorithm
8.3.2.1 The reverse Polish notation
8.3.2.2 The GP breeding rules
8.3.2.3 Mutation
8.3.2.4 The dual-chromosome crossover
8.3.2.5 Single-chromosome crossover
8.3.2.6 The training of a GP model
8.4 Factor Xa protease residue interplay
Summary
9 Advanced Subjects
9.1 Neural networks and deep learning
9.2 Optimisation with evolutionary computation
9.3 Quantum computing for biological pattern analysis
9.4 Next-generation sequencing data quality
9.5 SARS-CoV-2 protease cleavage pattern discovery
References
Index