This book presents a focus on proteins and their structures. The text describes various scalable solutions for protein structure similarity searching, carried out at main representation levels and for prediction of 3D structures of proteins. Emphasis is placed on techniques that can be used to accelerate similarity searches and protein structure modeling processes. The content of the book is divided into four parts. The first part provides background information on proteins and their representation levels, including a formal model of a 3D protein structure used in computational processes, and a brief overview of the technologies used in the solutions presented in the book. The second part of the book discusses Cloud services that are utilized in the development of scalable and reliable cloud applications for 3D protein structure similarity searching and protein structure prediction. The third part of the book shows the utilization of scalable Big Data computational frameworks, like Hadoop and Spark, in massive 3D protein structure alignments and identification of intrinsically disordered regions in protein structures. The fourth part of the book focuses on finding 3D protein structure similarities, accelerated with the use of GPUs and the use of multithreading and relational databases for efficient approximate searching on protein secondary structures. The book introduces advanced techniques and computational architectures that benefit from recent achievements in the field of computing and parallelism. Recent developments in computer science have allowed algorithms previously considered too time-consuming to now be efficiently used for applications in bioinformatics and the life sciences. Given its depth of coverage, the book will be of interest to researchers and software developers working in the fields of structural bioinformatics and biomedical databases. Read more...
Abstract: This book presents a focus on proteins and their structures. The text describes various scalable solutions for protein structure similarity searching, carried out at main representation levels and for prediction of 3D structures of proteins. Emphasis is placed on techniques that can be used to accelerate similarity searches and protein structure modeling processes. The content of the book is divided into four parts. The first part provides background information on proteins and their representation levels, including a formal model of a 3D protein structure used in computational processes, and a brief overview of the technologies used in the solutions presented in the book. The second part of the book discusses Cloud services that are utilized in the development of scalable and reliable cloud applications for 3D protein structure similarity searching and protein structure prediction. The third part of the book shows the utilization of scalable Big Data computational frameworks, like Hadoop and Spark, in massive 3D protein structure alignments and identification of intrinsically disordered regions in protein structures. The fourth part of the book focuses on finding 3D protein structure similarities, accelerated with the use of GPUs and the use of multithreading and relational databases for efficient approximate searching on protein secondary structures. The book introduces advanced techniques and computational architectures that benefit from recent achievements in the field of computing and parallelism. Recent developments in computer science have allowed algorithms previously considered too time-consuming to now be efficiently used for applications in bioinformatics and the life sciences. Given its depth of coverage, the book will be of interest to researchers and software developers working in the fields of structural bioinformatics and biomedical databases
Content: Intro
Foreword
Preface
Scope of the Book
Chapter Overview
Summary
Acknowledgements
Contents
Acronyms
Part I Background
1 Formal Model of 3D Protein Structures for Functional Genomics, Comparative Bioinformatics, and Molecular Modeling
1.1 Introduction
1.2 General Definition of Protein Spatial Structure
1.3 A Reference to Representation Levels
1.3.1 Primary Structure
1.3.2 Secondary Structure
1.3.3 Tertiary Structure
1.3.4 Quaternary Structure
1.4 Relative Coordinates of Protein Structures
1.5 Energy Properties of Protein Structures
1.6 Summary
References. 2 Technological Roadmap2.1 Cloud Computing
2.1.1 Cloud Service Models
2.1.2 Cloud Deployment Models
2.2 Big Data Challenge
2.2.1 The 5V Model of Big Data
2.2.2 Hadoop Platform
2.3 Multi-threading and Multi-threaded Applications
2.4 Graphics Processing Units and the CUDA
2.4.1 Graphics Processing Units
2.4.2 CUDA Architecture and Threads
2.5 Relational Databases and SQL
2.5.1 Relational Database Management Systems
2.5.2 SQL For Manipulating Relational Data
2.6 Scalability
2.7 Summary
References
Part II Cloud Services for Scalable Computations
3 Azure Cloud Services. 3.1 Microsoft Azure3.2 Virtual Machines, Series, and Sizes
3.3 Cloud Services in Action
3.4 Summary
References
4 Scaling 3D Protein Structure Similarity Searching with Azure Cloud Services
4.1 Introduction
4.1.1 Why We Need Cloud Computing in Protein Structure Similarity Searching
4.1.2 Algorithms for Protein Structure Similarity Searching
4.1.3 Other Cloud-Based Solutions for Bioinformatics
4.2 Cloud4PSi for 3D Protein Structure Alignment
4.2.1 Use Case: Interaction with the Cloud4PSi
4.2.2 Architecture and Processing Model of the Cloud4PSi
4.2.3 Scaling Cloud4PSi. 4.3 Scalability of the Cloud4PSi4.3.1 Horizontal Scalability
4.3.2 Vertical Scalability
4.3.3 Influence of the Package Size
4.3.4 Scaling Up or Scaling Out?
4.4 Discussion
4.5 Summary
References
5 Cloud Services for Efficient Ab Initio Predictions of 3D Protein Structures
5.1 Introduction
5.1.1 Computational Approaches for 3D Protein Structure Prediction
5.1.2 Cloud and Grid Computing in Protein Structure Determination
5.2 Cloud4PSP for 3D Protein Structure Prediction
5.2.1 Prediction Method
5.2.2 Cloud4PSP Architecture
5.2.3 Cloud4PSP Processing Model
5.2.4 Extending Cloud4PSP. 5.2.5 Scaling the Cloud4PSP5.3 Performance of the Cloud4PSP
5.3.1 Vertical Scalability
5.3.2 Horizontal Scalability
5.3.3 Influence of the Task Size
5.3.4 Scale Up, Scale Out, or Combine?
5.4 Discussion
5.5 Summary
5.6 Availability
References
Part III Big Data Analytics in Protein Bioinformatics
6 Foundations of the Hadoop Ecosystem
6.1 Big Data
6.2 Hadoop
6.2.1 Hadoop Distributed File System
6.2.2 MapReduce Processing Model
6.2.3 MapReduce 1.0 (MRv1)
6.2.4 MapReduce 2.0 (MRv2)
6.3 Apache Spark
6.4 Hadoop Ecosystem
6.5 Summary
References.