Machine Learning in Bioinformatics of Protein Sequences guides readers around the rapidly advancing world of cutting-edge machine learning applications in the protein bioinformatics field. Edited by bioinformatics expert, Dr Lukasz Kurgan, and with contributions by a dozen of accomplished researchers, this book provides a holistic view of the structural bioinformatics by covering a broad spectrum of algorithms, databases and software resources for the efficient and accurate prediction and characterization of functional and structural aspects of proteins. It spotlights key advances which include deep neural networks, natural language processing-based sequence embedding and covers a wide range of predictions which comprise of tertiary structure, secondary structure, residue contacts, intrinsic disorder, protein, peptide and nucleic acids-binding sites, hotspots, post-translational modification sites, and protein function. This volume is loaded with practical information that identifies and describes leading predictive tools, useful databases, webservers, and modern software platforms for the development of novel predictive tools.
Author(s): Łukasz Kurgan
Publisher: World Scientific Publishing
Year: 2022
Language: English
Pages: 377
City: Singapore
Contents
Preface
Acknowledgments
About the Editor
Part I Machine Learning Algorithms
Chapter 1 Deep Learning Techniques for De novo Protein Structure Prediction
1. Introduction
2. Architectures of deep neural networks
2.1 Convolutional neural networks
2.2 Recurrent neural networks
2.3 Attention-based neural networks
3. Self-supervised protein sequence representation
3.1 Single-sequence-based protein sequence representation
3.2 MSA-based protein sequence representation
4. Secondary structure prediction
4.1 Neural networks used for local structure prediction
4.2 State-of-the-art SSP approaches benefit from larger data, deeper networks, and better evolutionary features
5. Contact map prediction
5.1 Neural networks used for contact map prediction
5.2 Novel strategies used in state-of-the-art CMP approaches
6. End-to-end tertiary structure prediction
7. Conclusions
References
Part II Inputs for Machine Learning Models
Chapter 2 Application of Sequence Embedding in Protein Sequence-Based Predictions
1. Introduction
2. A brief overview of language models and embeddings in Natural Language Processing
3. Protein databases facilitating language modeling
4. Adapting language models for protein sequences
4.1 ProtVec (word2vec)
4.2 UDSMProt (AWD-LSTM)
4.3 UniRep (mLSTM)
4.4 SeqVec (ELMo)
4.5 ESM-1b (Transformer)
4.6 ProtTrans (BERT)
5. Conclusions
6. Acknowledgement
References
Chapter 3 Applications of Natural Language Processing Techniques in Protein Structure and Function Prediction
1. Introduction
2. Methods for protein sequence analysis
3. Computational prediction of protein structures
3.1 Protein fold recognition
3.2 Intrinsically disorder regions/proteins identification
4. Computational prediction of protein functions
4.1 Prediction of functions of intrinsically disorder regions
4.2 Protein-nucleic acids binding prediction
4.2.1 Nucleic acid binding protein prediction
4.2.2 Nucleic acid binding residue prediction
5. Biological language models (BLM)
6. Summary and recommendations
7. Acknowledgments
References
Chapter 4 NLP-based Encoding Techniques for Prediction of Post-translational Modification Sites and Protein Functions
1. Introduction
2. NLP-based encoding techniques for protein sequence
2.1 Tokenization
2.2 Local/sparse representation of protein sequences
2.3 Distributed representation of protein sequences
2.3.1 Word embedding in proteins
2.3.2 Context-independent word embedding for protein sequence
2.3.3 Contextual word embedding for protein sequence
2.4 Variety of databases for generating pre-trained language models
3. Methods using NLP-based encoding for PTM prediction: local-level task
3.1 Local/sparse representation-based methods for PTM prediction
3.2 PTM site prediction approaches using distribute drepresentation
3.2.1 Context-independent supervised word embedding-based PTM site-prediction approaches (aka supervised embedding layer)
3.2.2 Context-independent pre-trained word embedding-based approaches for PTM prediction
3.2.3 Methods using both context independent supervised word embedding + context independent pre-trained model
3.2.4 Methods using pre-trained contextual embedding language model (BERT-based)
4. Methods using NLP-based encoding for GO-based protein function prediction: Global-level task
4.1 Local/sparse representation-based methods GO-based protein function prediction
4.2 Distributed representation-based methods GO-based protein function prediction
4.2.1 Context-independent supervised word embedding-based GO prediction approaches word (aka supervised embedding layer)
4.2.2 Context-independent pre-trained word embedding-based approaches for the GO prediction
4.2.3 Contextual word embedding-based protein GO prediction
4.2.4 Other approaches
5. Conclusion and discussion
References
Chapter 5 Feature-Engineering from Protein Sequences to Predict Interaction Sites Using Machine Learning
1. Introduction
2. Data labeling for the prediction of interaction sites
3. Featurization of protein sequences
4. Direct protein sequence features
4.1 One-hot/sparse encoding
4.2 Amino acid indices
4.3 Physicochemical property-based encoding
4.4 Global protein sequence features
4.5 Window size
5. Derived sequence features
6. Predicted features
7. Summary and conclusions
References
Part III Predictors of Protein Structure and Function
Chapter 6 Machine Learning Methods for Predicting Protein Contacts
1. Introduction
2. Residue contact definitions
2.1 Contact maps
3. Machine learning algorithms in contact prediction methods
3.1 Hidden Markov Models
3.2 Support Vector Machines
3.3 Random Forest Algorithms
3.4 Naïve Bayes Classifiers
3.5 Neural Networks
3.5.1 Deep neural networks
3.5.1.1 Residual convolutional neural networks
3.5.1.2 Recurrent neural network
3.5.1.3 End-to-end learning models
4. Conclusions
5. Acknowledgments
References
Chapter 7 Machine Learning for Protein Inter-Residue Interaction Prediction
1. Introduction
2. Computational methods for protein inter-residue interaction prediction
2.1 Definition of geometry terms for inter-residue interactions
2.2 Unsupervised methods for contact map prediction
2.3 Supervised methods for contact map prediction
3. Application of protein inter-residue interaction prediction
4. Discussion
5. Acknowledgment
References
Chapter 8 Machine Learning for Intrinsic Disorder Prediction
1. Introduction
2. Overview of disorder predictors
3. Disorder prediction using machine learning
4. Selected machine learning-based disorder predictors
4.1 PrDOS
4.2 MFDp
4.3 DISOPRED3
4.4 AUCpred
4.5 SPOT-Disorder2
4.6 flDPnn
5. Related resources
6. Summary
7. Funding
References
Chapter 9 Sequence-Based Predictions of Residues that Bind Proteins and Peptides
1. Introduction
2. Commonly used biological databases
2.1 Protein Data Bank (PDB)
2.2 BioLiP
3. Biological characteristics and sequence-based representation of proteins
3.1 Protein sequence-derived information
3.1.1 One-hot encoding
3.1.2 Pre-trained embedding
3.2 Evolutionary information
3.3 Predicted structural features of proteins
3.4 Amino acid physicochemical characteristics
3.5 Other features relevant to the prediction of PPI and PPepI sites
4. Performance evaluation
4.1 Experimental data preparation
4.2 Validation scheme
4.2.1 Hold-out validation
4.2.2 K-fold CV
4.2.3 Leave-one-out CV
4.3 Evaluation metrics
5. Computational methods for protein-protein interaction (PPI) site prediction
5.1 PSIVER
5.2 LORIS
5.3 DLPred
5.4 SCRIBER
5.5 DELPHI
6. Computational methods for protein-peptide interaction (PPepI) site prediction
6.1 SPRINT
6.2 PepBind
6.3 Visual
6.4 MTDsite
7. Summary
8. Acknowledgment
References
Chapter 10 Machine Learning Methods for Predicting Protein-Nucleic Acids Interactions
1. Introduction
2. Prediction of the protein-nucleic acid binding residues from sequence
2.1 Overview of the sequence-based predictors
2.2 Architectures of the sequence-based predictors
3. Summary
References
Chapter 11 Identification of Cancer Hotspot Residues and Driver Mutations Using Machine Learning
1. Introduction
2. Experimental and computational studies on cancer mutations
3. Databases of the cancer-causing mutations
4. Identification of hotspot residues
5. Methods for predicting disease-causing mutations
6. Machine learning techniques for predicting cancer-causing mutations
7. Large-scale annotation of cancer-causing mutations
8. Conclusions
9. Acknowledgements
References
Part IV Practical Resources
Chapter 12 Designing Effective Predictors of Protein Post-Translational Modifications Using iLearnPlus
1. Introduction
2. Brief review of computational PTM site prediction
3. Design of novel predictive methods using iLearnPlus
3.1 iLearnPlus
3.2 Data collection and preprocessing
3.3 Model construction and performance evaluation
3.4 Comparison with other ML algorithms
4. Summary
References
Chapter 13 Databases of Protein Structure and Function Predictions at the Amino Acid Level
1. Introduction
2. Databases of the AA-level predictions
2.1 MobiDB
2.2 D2P2
2.3 DescribePROT
2.4 Example results
3. Conclusions, impact and limitations
4. Funding
References
Index