Big Data Analytics in Chemoinformatics and Bioinformatics: With Applications to Computer-Aided Drug Design, Cancer Biology, Emerging Pathogens and Computational Toxicology

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Big Data Analytics in Chemoinformatics and Bioinformatics: With Applications to Computer-Aided Drug Design, Cancer Biology, Emerging Pathogens and Computational Toxicology provides an up-to-date presentation of big data analytics methods and their applications in diverse fields. The proper management of big data for decision-making in scientific and social issues is of paramount importance. This book gives researchers the tools they need to solve big data problems in these fields. It begins with a section on general topics that all readers will find useful and continues with specific sections covering a range of interdisciplinary applications.

Here, an international team of leading experts review their respective fields and present their latest research findings, with case studies used throughout to analyze and present key information.

Author(s): Subhash C. Basak , Marjan Vračko
Publisher: Elsevier
Year: 2022

Language: English
Pages: 483
City: Amsterdam

Big Data Analytics in Chemoinformatics and Bioinformatics
Preface
List of contributors
Copyright
Contents
1 Chemoinformatics and bioinformatics by discrete mathematics and numbers: an adventure from small data to the realm of eme...
1.1 Introduction
1.2 Chemobioinformatics—a confluence of disciplines?
1.2.1 Physical property: colligative versus constitutive
1.2.2 Early biochemical observations on the relationship between chemical structure and bioactivity of molecules
1.2.3 Linear free energy relationship: the multiparameter Hansch approach to quantitative structure–activity relationship
1.2.4 Chemical graph theory and quantum chemistry as the source of chemodescriptors
1.2.4.1 Topological indices—graph theoretic definitions and calculation methods
1.2.4.2 What do the topological indices represent about molecular structure?
1.3 Bioifnormatics: quantitative inforamtics in the age of big biology
1.4 Major pillars of model building
1.5 Discussion
1.6 Conclusion
Acknowledgment
References
2 Robustness concerns in high-dimensional data analyses and potential solutions
2.1 Introduction
2.2 Sparse estimation in high-dimensional regression models
2.2.1 Starting of the era: the least absolute shrinkage and selection operator
2.2.2 Likelihood-based extensions of the LASSO
2.2.3 Search for a better penalty function
2.3 Robustness concerns for the penalized likelihood methods
2.4 Penalized M-estimation for robust high-dimensional analyses
2.5 Robust minimum divergence methods for high-dimensional regressions
2.5.1 The minimum penalized density power divergence estimator
2.5.2 Asymptotic properties of the MDPDE under high-dimensional GLMs
2.6 A real-life application: identifying important descriptors of amines for explaining their mutagenic activity
2.7 Concluding remarks
Appendix: A list of useful R-packages for high-dimensional data analysis
Acknowledgments
References
3 Fairness, explainability, privacy, and robustness for trustworthy algorithmic decision-making
3.1 Introduction
3.2 Fairness in machine learning
3.2.1 Fairness metrics and definitions
3.2.2 Bias mitigation in machine learning models
3.2.2.1 Preprocessing
3.2.2.2 In-processing
3.2.2.3 Postprocessing
3.2.3 Implementation
3.3 Explainable artificial intelligence
3.3.1 Formal objectives of explainable artificial intelligence
3.3.1.1 Why explain?
3.3.1.2 Terminologies
3.3.2 Taxonomy of methods
3.3.2.1 In-model versus post-model explanations
3.3.2.2 Global and local explanations
3.3.2.3 Causal explainability
3.3.3 Do explanations serve their purpose?
3.3.3.1 From explanation to understanding
3.3.3.2 Implementations and tools
3.4 Notions of algorithmic privacy
3.4.1 Preliminaries of differential privacy
3.4.2 Privacy-preserving methodology
3.4.2.1 Local sensitivity and other mechanisms
3.4.2.2 Algorithms with differential privacy guarantees
3.4.3 Generalizations, variants, and applications
3.4.3.1 Pufferfish
3.4.3.2 Other variations
3.4.3.3 Implementations
3.5 Robustness
3.5.1 Adversarial attacks
3.5.2 Defense mechanisms
3.5.2.1 Adversarial (re)training
3.5.2.2 Use of regularization
3.5.2.3 Certified defenses
3.5.3 Implementations
3.6 Discussion
References
4 How to integrate the “small and big” data into a complex adverse outcome pathway?
4.1 Introduction
4.2 State and review
4.3 Binding affinity to androgen nuclear receptor evaluated with respect to carcinogenic potency data
4.4 Conclusion and future directions
References
5 Big data and deep learning: extracting and revising chemical knowledge from data
5.1 Introduction
5.2 Basic methods in neural networks and deep learning
5.2.1 Neural networks
5.2.2 Neural network learning
5.2.3 Deep learning and multilayer neural networks
5.2.3.1 Convolutional neural network
5.2.3.2 Recurrent neural network
5.2.3.3 Graph convolutional neural networks
5.2.4 Attention mechanism
5.3 Neural networks for quantitative structure–activity relationship: input, output, and parameters
5.3.1 Input
5.3.2 Chemical graphs and their representation
5.3.2.1 SMILES as input
5.3.2.2 Images of two-dimensional structures as input
5.3.2.3 Chemical graphs as input
5.3.3 Output
5.3.4 Performance parameters
5.4 Deep learning models for mutagenicity prediction
5.4.1 Structure–activity relationship and quantitative structure–activity relationship models for Ames test
5.4.2 Deep learning models for Ames test
5.4.2.1 Learning from SMILES
5.4.2.2 Learning from images
5.4.2.3 Integrating features from SMILES and images
5.4.2.4 Learning from chemical graphs
5.5 Interpreting deep neural network models
5.5.1 Extracting substructures
5.5.2 Comparison of substrings with SARpy SAs
5.5.3 Comparison of substructures with Toxtree
5.6 Discussion and conclusions
5.6.1 A future for deep learning models
References
6 Retrosynthetic space modeled by big data descriptors
6.1 Introduction
6.2 Computer-assisted organic synthesis
6.2.1 Retrosynthetic space explored by molecular descriptors using big data sets
6.2.2 The exploration of chemical retrosynthetic space using retrosynthetic feasibility functions
6.3 Quantitative structure–activity relationship model
6.4 Dimensionality reduction using retrosynthetic analysis
6.5 Discussion
References
7 Approaching history of chemistry through big data on chemical reactions and compounds
7.1 Introduction
7.2 Computational history of chemistry
7.2.1 Data and tools
7.3 The expanding chemical space, a case study for computational history of chemistry
7.4 Conclusions
Acknowledgments
References
8 Combinatorial and quantum techniques for large data sets: hypercubes and halocarbons
8.1 Introduction
8.2 Combinatorial techniques for isomer enumerations to generate large datasets
8.2.1 Combinatorial techniques for large data structures
8.2.2 Möbius inversion
8.2.3 Combinatorial results
8.3 Quantum chemical techniques for large data sets
8.3.1 Computational techniques for halocarbons
8.3.2 Results and discussions of quantum computations and toxicity of halocarbons
8.4 Hypercubes and large datasets
8.5 Conclusion
References
9 Development of quantitative structure–activity relationship models based on electrophilicity index: a conceptual DFT-base...
9.1 Introduction
9.2 Theoretical background
9.3 Computational details
9.4 Methodology
9.5 Results and discussion
9.5.1 Tetrahymena pyriformis
9.5.2 Tryphanosoma brucei
9.6 Conclusion
Acknowledgments
Conflict of interest
References
10 Pharmacophore-based virtual screening of large compound databases can aid “big data” problems in drug discovery
10.1 Introduction
10.2 Background of data analytics, machine learning, intelligent augmentation methods and applications in drug discovery
10.2.1 Applications of data analytics in drug discovery
10.2.2 Machine learning in drug discovery
10.2.3 Application of other computational approaches in drug discovery
10.2.4 Predictive drug discovery using molecular modeling
10.3 Pharmacophore modeling
10.3.1 Case studies
10.4 Concluding remarks
References
11 A new robust classifier to detect hot-spots and null-spots in protein–protein interface: validation of binding pocket an...
11.1 Introduction
11.2 Training and testing of the classifier
11.2.1 Variable selection using recursive feature elimination
11.2.2 Random forest performed best using both published and combined datasets
11.3 Technical details to develop novel protein–protein interaction hotspot prediction program
11.3.1 Training data
11.3.2 Building and validating a novel classifier by evaluating state-of-the-art feature selection and machine learning alg...
11.4 A case study
11.4.1 Identification of a druggable protein–protein interaction site between mutant p53 and its stabilizing chaperone DNAJ...
11.4.2 Building the homology model of DNAJA1 and optimizing the mutp53 (R175H) structure
11.4.3 Protein–protein docking
11.4.4 Small molecules inhibitors identification through drug-like library screening against the DNAJA1- mutp53R175H intera...
11.5 Discussion
Author contribution
Acknowledgment
Conflicts of interest
References
12 Mining big data in drug discovery—triaging and decision trees
12.1 Introduction
12.2 Big data in drug discovery
12.3 Triaging
12.4 Decision trees
12.5 Recursive partitioning
12.6 PhyloGenetic-like trees
12.7 Multidomain classification
12.8 Fuzzy trees and clustering
Acknowledgments
References
13 Use of proteomics data and proteomics-based biodescriptors in the estimation of bioactivity/toxicity of chemicals and na...
13.1 Introduction
13.2 Proteomics technologies and their toxicological applications
13.2.1 Two-dimensional gel electrophoresis
13.2.1.1 Information theoretic approach for the quantification of proteomics maps
13.2.1.2 Chemometric approach for the calculation of spectrum-like mathematical proteomics descriptors
13.2.2 Mass spectrometry-based proteomics technology and their applications in mathematical nanotoxicoproteomics
13.3 Discussion
Acknowledgment
References
14 Mapping interaction between big spaces; active space from protein structure and available chemical space
14.1 Introduction
14.2 Background
14.2.1 Navigating protein fold space
14.2.2 From amino acid string to dynamic structural fold
14.2.3 Elements for classification of protein
14.2.4 Available methods for classifying proteins
14.3 Protein topology for exploring structure space
14.3.1 Modularity in protein structure space
14.3.2 Data-driven approach to extract topological module
14.4 Scaffolds curve the functional and catalytic sites
14.4.1 Signature of catalytic site in protein structures
14.4.2 Protein function-based selection of topological space
14.4.3 Protein dynamics and transient sites
14.4.4 Learning methods for the prediction of proteins and functional sites
14.5 Protein interactive sites and designing of inhibitor
14.5.1 Interaction space exploration for energetically favorable binding features identification
14.5.2 Protein dynamics guided binding features selection
14.5.3 Protein flexibility and exploration of ligand recognition site
14.5.4 Artificial intelligence to understand the interactions of protein and chemical
14.6 Intrinsically unstructured regions and protein function
14.7 Conclusions
Acknowledgments
References
15 Artificial intelligence, big data and machine learning approaches in genome-wide SNP-based prediction for precision medi...
15.1 Introduction
15.2 Role of artificial intelligence and machine learning in medicine
15.3 Genome-wide SNP prediction
15.4 Artificial intelligence, precision medicine and drug discovery
15.5 Applications of artificial intelligence in disease prediction and analysis oncology
15.6 Cardiology
15.7 Neurology
15.8 Conclusion
Abbreviations
References
16 Applications of alignment-free sequence descriptors in the characterization of sequences in the age of big data: a case ...
16.1 Introduction
16.2 Section 1—bioinformatics today: problems now
16.2.1 What is bioinformatics and genomics?
16.2.2 Annotations
16.2.3 Evolution of sequencing methods
16.2.4 Alignment-free sequence descriptors
16.2.5 Metagenomics
16.2.6 Software development: scenario and challenges
16.2.7 Data formats
16.2.8 Storage and exchange
16.3 Section 2—bioinformatics today and tomorrow: sustainable solutions
16.3.1 The need for big data
16.3.1.1 Volume
16.3.1.2 Variety
16.3.2 Software and development
16.3.2.1 Support for huge volume
16.3.2.2 Optimal efficiency in storage
16.3.2.3 Good data recovery solution
16.3.2.4 Horizontal scaling
16.3.2.5 Cost effective
16.3.2.6 Ease of access and understanding
16.3.2.6.1 Why “Hadoop”?
16.3.2.6.2 What is Hadoop?
16.3.2.7 Overview of Hadoop distributed file system
16.3.2.8 Overview of MapReduce
16.3.2.9 Some problems with MapReduce
16.3.2.10 Apache Pig
16.3.2.11 Data formats
16.3.2.12 May I have some structured query language please?
16.3.2.13 Storage and exchange
16.3.2.14 Visualization
16.4 Summary
References
17 Scalable quantitative structure–activity relationship systems for predictive toxicology
17.1 Background
17.2 Scalability in quantitative structure–activity relationship modeling
17.2.1 Consequences of inability to scale
17.2.2 Expandability of the training dataset
17.2.3 Efficiency of data curation
17.2.4 Ability to handle stereochemistry
17.2.5 Ability to use proprietary training data
17.2.6 Ability to handle missing data
17.2.7 Ability to modify the descriptor set
17.2.8 Scaling expert rule-based systems
17.2.9 Scalability of adverse outcome pathway-based quantitative structure–activity relationship systems
17.2.10 Scalability of the supporting resources
17.2.11 Scalability of quantitative structure–activity relationships validation protocols
17.2.12 Scalability after deployment
17.2.13 Ability to use computer hardware resources effectively
17.3 Summary
References
18 From big data to complex network: a navigation through the maze of drug–target interaction
18.1 Introduction
18.2 Databases
18.2.1 Chemical databases
18.2.1.1 DrugBank
18.2.1.2 PubChem
18.2.1.3 ChEMBL
18.2.1.4 ChemSpider
18.2.2 Databases for targets
18.2.2.1 UniProt
18.2.2.2 Protein Data Bank
18.2.2.3 String
18.2.2.4 BindingDB
18.2.3 Databases for traditional Chinese medicine
18.2.3.1 Traditional Chinese medicine Database@Taiwan
18.2.3.2 Traditional Chinese medicine systems pharmacology
18.2.3.3 Traditional Chinese medicine integrated database
18.3 Prediction, construction, and analysis of drug–target network
18.3.1 Algorithms to predict drug–target interaction network
18.3.1.1 Machine learning-based methods
18.3.1.2 Similarity-based methods
18.3.2 Tools for network construction
18.3.2.1 Cytoscape
18.3.2.2 Pajek
18.3.2.3 Gephi
18.3.2.4 NetworkX
18.3.3 Network topological analysis
18.3.3.1 Degree distribution
18.3.3.2 Path and distance
18.3.3.3 Module and motifs
18.4 Conclusion and perspectives
Acknowledgments
References
19 Dissecting big RNA-Seq cancer data using machine learning to find disease-associated genes and the causal mechanism
19.1 Introduction
19.2 Bird’s eye view of the analysis of cancer RNA-Seq data using machine learning
19.3 Materials and methods
19.3.1 Preprocessing of the data
19.3.2 Feature selection
19.3.3 Classification learning
19.3.4 Extraction of disease-associated genes
19.3.5 Validation
19.4 Hand-in-hand walk with RNA-Seq data
19.4.1 Dataset selection
19.4.2 Data preprocessing
19.4.3 Feature selection
19.4.4 Classification model
19.4.5 Identification of the genes involved in disease progression
19.4.6 Significance of the identified deeply associated genes
19.5 Conclusion
References
Index