Machine Learning for Protein Subcellular Localization Prediction

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Comprehensively covers protein subcellular localization from single-label prediction to multi-label prediction, and includes prediction strategies for virus, plant, and eukaryote species. Three machine learning tools are introduced to improve classification refinement, feature extraction, and dimensionality reduction.

Author(s): Shibiao Wan, Man-Wai Mak
Publisher: De Gruyter
Year: 2015

Language: English
Pages: 209
City: Berlin

Preface
Contents
List of Abbreviations
1 Introduction
1.1 Proteins and their subcellular locations
1.2 Why computationally predict protein subcellular localization?
1.2.1 Significance of the subcellular localization of proteins
1.2.2 Conventional wet-lab techniques
1.2.3 Computational prediction of protein subcellular localization
1.3 Organization of this book
2 Overview of subcellular localization prediction
2.1 Sequence-based methods
2.1.1 Composition-based methods
2.1.2 Sorting signal-based methods
2.1.3 Homology-based methods
2.2 Knowledge-based methods
2.2.1 GO-term extraction
2.2.2 GO-vector construction
2.3 Limitations of existing methods
2.3.1 Limitations of sequence-based methods
2.3.2 Limitations of knowledge-based methods
3 Legitimacy of using gene ontology information
3.1 Direct table lookup?
3.1.1 Table lookup procedure for single-label prediction
3.1.2 Table-lookup procedure for multi-label prediction
3.1.3 Problems of table lookup
3.2 Using only cellular component GO terms?
3.3 Equivalent to homologous transfer?
3.4 More reasons for using GO information
4 Single-location protein subcellular localization
4.1 Extracting GO from the Gene Ontology Annotation Database
4.1.1 Gene Ontology Annotation Database
4.1.2 Retrieval of GO terms
4.1.3 Construction of GO vectors
4.1.4 Multiclass SVM classification
4.2 FusionSVM: Fusion of gene ontology and homology-based features
4.2.1 InterProGOSVM: Extracting GO from InterProScan
4.2.2 PairProSVM: A homology-based method
4.2.3 Fusion of InterProGOSVM and PairProSVM
4.3 Summary
5 From single- to multi-location
5.1 Significance of multi-location proteins
5.2 Multi-label classification
5.2.1 Algorithm-adaptation methods
5.2.2 Problem transformation methods
5.2.3 Multi-label classification in bioinformatics
5.3 mGOASVM: A predictor for both single- and multi-location proteins
5.3.1 Feature extraction
5.3.2 Multi-label multiclass SVM classification
5.4 AD-SVM: An adaptive decision multi-label predictor
5.4.1 Multi-label SVM scoring
5.4.2 Adaptive decision for SVM (AD-SVM)
5.4.3 Analysis of AD-SVM
5.5 mPLR-Loc: A multi-label predictor based on penalized logistic regression
5.5.1 Single-label penalized logistic regression
5.5.2 Multi-label penalized logistic regression
5.5.3 Adaptive decision for LR (mPLR-Loc)
5.6 Summary
6 Mining deeper on GO for protein subcellular localization
6.1 Related work
6.2 SS-Loc: Using semantic similarity over GO
6.2.1 Semantic similarity measures
6.2.2 SS vector construction
6.3 HybridGO-Loc: Hybridizing GO frequency and semantic similarity features
6.3.1 Hybridization of two GO features
6.3.2 Multi-label multiclass SVM classification
6.4 Summary
7 Ensemble random projection for large-scale predictions
7.1 Random projection
7.2 RP-SVM: A multi-label classifier with ensemble random projection
7.2.1 Ensemble multi-label classifier
7.2.2 Multi-label classification
7.3 R3P-Loc: A compact predictor based on ridge regression and ensemble random projection
7.3.1 Limitation of using current databases
7.3.2 Creating compact databases
7.3.3 Single-label ridge regression
7.3.4 Multi-label ridge regression
7.4 Summary
8 Experimental setup
8.1 Prediction of single-label proteins
8.1.1 Datasets construction
8.1.2 Performance metrics
8.2 Prediction of multi-label proteins
8.2.1 Dataset construction
8.2.2 Datasets analysis
8.2.3 Performance metrics
8.3 Statistical evaluation methods
8.4 Summary
9 Results and analysis
9.1 Performance of GOASVM
9.1.1 Comparing GO vector construction methods
9.1.2 Performance of successive-search strategy
9.1.3 Comparing with methods based on other features
9.1.4 Comparing with state-of-the-art GO methods
9.1.5 GOASVM using old GOA databases
9.2 Performance of FusionSVM
9.2.1 Comparing GO vector construction and normalization methods
9.2.2 Performance of PairProSVM
9.2.3 Performance of FusionSVM
9.2.4 Effect of the fusion weights on the performance of FusionSVM
9.3 Performance of mGOASVM
9.3.1 Kernel selection and optimization
9.3.2 Term-frequency for mGOASVM
9.3.3 Multi-label properties for mGOASVM
9.3.4 Further analysis of mGOASVM
9.3.5 Comparing prediction results of novel proteins
9.4 Performance of AD-SVM
9.5 Performance of mPLR-Loc
9.5.1 Effect of adaptive decisions on mPLR-Loc
9.5.2 Effect of regularization on mPLR-Loc
9.6 Performance of HybridGO-Loc
9.6.1 Comparing different features
9.7 Performance of RP-SVM
9.7.1 Performance of ensemble random projection
9.7.2 Comparison with other dimension-reduction methods
9.7.3 Performance of single random-projection
9.7.4 Effect of dimensions and ensemble size
9.8 Performance of R3P-Loc
9.8.1 Performance on the compact databases
9.8.2 Effect of dimensions and ensemble size
9.8.3 Performance of ensemble random projection
9.9 Comprehensive comparison of proposed predictors
9.9.1 Comparison of benchmark datasets
9.9.2 Comparison of novel datasets
9.10 Summary
10 Properties of the proposed predictors
10.1 Noise data in the GOA Database
10.2 Analysis of single-label predictors
10.2.1 GOASVM vs FusionSVM
10.2.2 Can GOASVM be combined with PairProSVM?
10.3 Advantages of mGOASVM
10.3.1 GO-vector construction
10.3.2 GO subspace selection
10.3.3 Capability of handling multi-label problems
10.4 Analysis for HybridGO-Loc
10.4.1 Semantic similarity measures
10.4.2 GO-frequency features vs SS features
10.4.3 Bias analysis
10.5 Analysis for RP-SVM
10.5.1 Legitimacy of using RP
10.5.2 Ensemble random projection for robust performance
10.6 Comparing the proposed multi-label predictors
10.7 Summary
11 Conclusions and future directions
11.1 Conclusions
11.2 Future directions
A Webservers for protein subcellular localization
A.1 GOASVM webserver
A.2 mGOASVM webserver
A.3 HybridGO-Loc webserver
A.4 mPLR-Loc webserver
B Support vector machines
B.1 Binary SVM classification
B.2 One-vs-rest SVM classification
C Proof of no bias in LOOCV
D Derivatives for penalized logistic regression
Bibliography
Index