All About Bioinformatics: From Beginner to Expert

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

All About Bioinformatics: From Beginner to Expert provides readers with an overview of the fundamentals and advances in the field of bioinformatics, as well as some future directions. Each chapter is didactically organized and includes introduction, applications, tools, and future directions to cover the topics thoroughly. The book covers both traditional topics such as biological databases, algorithms, genetic variations, static methods, and structural bioinformatics, as well as contemporary advanced topics such as high-throughput technologies, drug informatics, system and network biology, and Machine Learning. It is a valuable resource for researchers and graduate students who are interested to learn more about bioinformatics to apply in their research work. A Machine Learning algorithm is a statistical computation method used in software to detect hidden patterns that are not obvious in a dataset and make reliable statistical predictions of similar new data. Machine Learning techniques attempt to find a pattern in a particular dataset; using these learned patterns, a similar pattern in a new dataset is identified. Machine Learning processes are somewhat close to statistical modeling and data collection. They look into the data to find trends and accordingly learn the pattern or parameters. We usually know about Machine Learning through social media connection suggestions, online shopping product recommendations, spam filters in our email inboxes etc. Recent advances in the life and Computer Sciences have permitted the use of computational techniques to address critical problems. Python is a free general-purpose programming language that is simple to learn and apply to a wide range of computational tasks. Python may be used to handle a variety of difficulties that research labs confront on a daily basis. Using computers and an appropriate programming language, it is possible to execute tasks such as data manipulation, biological data retrieval and parsing, automation, and simulation of biological problems efficiently. The goal of this chapter is to give a high-level look at the Python programming language, showing how it works and what it can do. The main data structures and flow control statements are shown. Following these fundamental concepts, topics such as file access, functions, and modules are addressed in greater depth. This chapter introduces the reader to Anaconda, a distribution software that includes libraries, packages, and editors (such as Jupyter notebooks) essential for Python development. The datatypes, regularly used operators in each datatype, and a range of helpful built-in data structures, such as strings, lists, tuples, and dictionaries, are introduced. Iterations are also described using terms such as while and for loops. Finally, objects and modules were introduced, and their use in data processing was described. At the end, there are references to more advanced Python topics. - Presents a holistic learning experience, beginning with an introduction to bioinformatics to recent advancements in the field - Discusses bioinformatics as a practice rather than in theory focusing on more application-oriented topics as high-throughput technologies, system and network biology, and workflow management systems - Encompasses chapters on statistics and Machine Learning to assist readers in deciphering trends and patterns in biological data

Author(s): Yasha Hasija
Publisher: Academic Press
Year: 2023

Language: English
Pages: 313
City: London

Front Cover
All About Bioinformatics
All About Bioinformatics: From Beginner to Expert
Copyright
Contents
1 - What is bioinformatics?
1.1 Introduction
1.2 History
1.3 Biological databases
1.4 Algorithms in computational biology
1.5 Genetic variation and bioinformatics
1.6 Structural bioinformatics
1.7 High-throughput technology
1.8 Drug informatics
1.9 System and network biology
1.10 Machine learning in bioinformatics
1.11 Bioinformatics workflow management systems
1.12 Application of bioinformatics
References
2 - Introduction to biological databases
2.1 Introduction
2.1.1 Characteristics of biological data
2.2 Types of databases
2.2.1 Primary database
2.2.2 Secondary database
2.2.3 Composite database
2.3 Models of databases
2.3.1 Flat file
2.3.2 Hierarchical model
2.3.3 Network model
2.3.4 Entity relationship model
2.3.5 Relational database model
2.3.6 Other models
2.4 Primary nucleic acid databases
2.4.1 EMBL
2.4.2 GenBank
2.4.3 DDBJ
2.5 Primary protein databases
2.5.1 PDB
2.5.2 SWISS-PROT
2.6 Secondary protein databases
2.6.1 CATH
2.6.2 SCOP
2.6.3 Prostate
2.7 Composite sequence databases
2.7.1 Meta-databases
2.8 Genomics and proteomics databases
2.8.1 The search engines for literature
2.9 Miscellaneous databases
2.9.1 Humans
2.9.2 Animals
2.9.3 Fungi
2.9.4 Microorganisms
2.9.5 Plant and crop genomic database
2.9.6 Organelle database
2.9.7 Pathway databases
References
3 - Statistical methods in bioinformatics
3.1 Introduction
3.2 Statistics at the interface of bioinformatics
3.3 Measures of central tendency
3.3.1 Mean
3.3.2 Median
3.3.2.1 Median for the grouped data
3.3.2.2 Median for the ungrouped data
3.3.3 Mode
3.3.4 Percentiles, quartiles and interquartile range
3.4 Skewness and kurtosis
3.5 Variability and its measures
3.5.1 Variance
3.5.2 Standard deviation
3.5.3 Standard error
3.5.4 Coefficient of variation
3.5.4.1 Comprehending the source of variability for analysis
3.6 Different types of distributions and their significance
3.6.1 Probability distributions
3.6.2 Continuous probability function
3.6.2.1 Normal distribution
3.6.2.2 Continuous uniform distribution
3.6.2.3 Log-normal distribution
3.6.2.4 Exponential distribution
3.6.3 Discrete probability function
3.6.3.1 Binomial distribution
3.6.3.2 Bernoulli's distribution
3.6.3.3 Poisson distribution
3.6.4 Normal distribution and normal curve
3.6.5 Normal curve
3.6.6 Asymmetrical distribution
3.7 Sampling
3.8 Probability
3.8.1 Laws of probability
3.8.1.1 Addition law of probability
3.8.1.2 Multiplication law of probability
3.8.1.3 Binomial law of probability distribution
3.8.1.4 Probability (chances) from shape of normal distribution or normal curve
3.8.1.5 Probability of calculated values from tables
3.9 Comparing the means of two or more data variables or groups
3.9.1 Independent samples t-test
3.9.2 One sample t-test
3.9.3 Paired samples t-test
3.9.4 ANOVA
3.9.5 The Chi-square tests
3.9.6 Test of independence
3.9.7 Test of goodness of fit
3.9.8 Correlation and regression
3.9.9 A look into correlation and regression
3.10 Platforms employed for statistical analysis
3.10.1 Downstream analysis and visualization
3.11 Gene ontology & pathway analysis
3.11.1 Singular enrichment analysis (SEA)
3.11.2 Gene set enrichment analysis (GSEA)
3.11.3 Modular enrichment analysis (MEA)
3.11.4 Correlation networks
3.12 Future prospects and conclusion
References
4 - Algorithms in computational biology
4.1 Sequence alignment
4.1.1 Local alignment
4.1.2 Global alignment
4.1.3 Gap penalty
4.1.3.1 Gaps and gap penalties
4.2 Pair-wise alignment
4.3 Dot-matrix method
4.4 Dynamic programming
4.4.1 Needleman-Wunsch
4.4.1.1 Step 1: Initialization table “T”
4.4.1.2 Step 2: Filling the matrix
4.4.1.3 Step 3: Traceback
4.4.2 Smith Waterman algorithm
4.4.2.1 Limitations
4.5 Scoring matrices
4.5.1 Scoring matrices for amino acids
4.5.2 PAM (point accepted mutation)
4.5.2.1 PAM score
4.5.2.2 Example
4.5.3 BLOcks SUbstitution matrix (BLOSUM)
4.5.3.1 BLOSUM62
4.5.3.2 BLOSUM score
4.6 Word methods
4.7 Multiple sequence alignment
4.7.1 Progressive alignment
4.7.2 Iterative method
4.7.3 MSA filtering
4.7.4 Filtering techniques' fundamental principles
4.7.5 Programs and methods for multiple sequence alignment
4.7.5.1 Clustal family
4.7.5.2 DIAlign
4.7.5.3 Tree-based consistency objective function for alignment evaluation (T-coffee)
4.7.5.4 FAlign
4.7.6 Representation and structural inference
4.8 Phylogenetics
4.8.1 Molecular phylogenetics
4.8.2 Phylogenetics trees
4.8.3 Properties
4.8.3.1 Phylogenetic trees and networks
4.8.4 Building methods
4.8.5 Distance matrix method
4.8.5.1 UPGMA
4.8.5.2 Neighbor-joining
4.8.5.3 Fitch Margoliash method
4.8.5.4 Maximum parsimony
4.8.5.5 Maximum likelihood
4.8.6 Bayesian inference
References
5 - Genetic variations
5.1 Introduction
5.2 Types of variations
5.3 Effects of genetic variation
5.4 Biological database
5.4.1 Database of human genetic variation
5.4.2 Predicting the clinical significance of human genetic variation
5.5 Phenotype-genotype association
5.6 Pharmacogenomics
5.6.1 Drug receptors
5.6.2 Drug uptake
5.6.3 Drug breakdown
5.7 Pharmacogenomics and targeted drug development
5.7.1 Personalized medicine
5.7.2 Personalized medicine drivers
5.7.2.1 Human genome sequencing has been completed
5.7.2.2 Molecular characterization of disease
5.7.2.3 Search for biomarkers of drug response
5.7.3 Future aspects of pharmacogenomics in personalized medicine
5.8 Computational biology methods for decision support in personalized medicine
5.8.1 Pharmacogenomics information
5.8.1.1 Pharmacogenomics knowledgebase (PharmGKB)
5.8.1.2 DrugBank
5.8.1.3 CPIC
References
6 - Structural bioinformatics
6.1 Introduction
6.2 Viewing protein structures
6.3 Alignment of protein structures
6.4 Structural prediction
6.4.1 Use of sequence patterns for protein structure prediction
6.4.2 Prediction of protein secondary structure from the amino acid sequence
6.4.3 Chou Fasman method
6.4.4 GOR method
6.4.5 Prediction of three-dimensional protein structure
6.4.6 Evaluating the success of structure predictions
References
7 - High throughput technology
7.1 Omics theory
7.2 High-throughput technologies
7.3 Genomics
7.3.1 What is DNA?
7.3.2 DNA microarray
7.3.2.1 Application of DNA microarray
7.3.3 DNA sequencing
7.3.3.1 Clone-by-clone process
7.3.3.2 Whole-genome shotgun process
7.3.3.3 Assembly of sequencing reads
7.3.4 Whole exome sequencing (WES)
7.3.5 Single cell DNA-SEQ (sc-DNA-seq)
7.4 Epigenomics
7.4.1 ChIP-seq
7.4.2 Whole-genome shotgun bisulfite sequencing (WGSBS)
7.5 Transcriptomics
7.5.1 RNA-seq
7.6 Proteomics
7.6.1 Reverse phase protein microarrays (RPPA)
7.7 Metabolomics
7.7.1 Different methods for studying metabolomics
References
8 - Drug informatics
8.1 Introduction
8.2 Computational drug designing and discovery
8.3 Structure based drug designing
8.3.1 Homology modeling
8.3.2 Molecular docking
8.3.2.1 Sampling algorithm
8.3.2.2 Scoring functions
8.3.3 Molecular simulation
8.4 Ligand-based drug designing
8.4.1 Pharmacophore modeling
8.5 ADMET
8.5.1 Adsorption
8.5.2 Distribution
8.5.3 Metabolism
8.5.4 Excretion
8.5.5 Toxicity
8.6 Drug repurposing
References
9 - A machine learning approach to bioinformatics
9.1 Introduction to machine learning?
9.2 Types of machine learning systems
9.2.1 Supervised learning
9.2.2 The below are the most commonly used supervised algorithms
9.2.2.1 Linear regression
9.2.3 Logistic regression
9.2.4 K-nearest neighbor
9.2.5 Decision trees
9.2.6 Support vector machines
9.2.6.1 Kernel trick
9.2.7 Neural networks
9.2.8 Neural networks architecture
9.2.9 Convolutional neural network
9.2.10 Unsupervised learning
9.2.11 K-means clustering
9.2.12 Reinforcement learning
9.3 Evaluation of machine learning models
9.3.1 Accuracy
9.3.2 Receiver Operating Characteristic (ROC) Curvature
9.3.3 Cross-validation
9.3.4 Testing and validating
9.4 Optimization of models
9.4.1 Parameter searching
9.4.2 Ensemble methods
9.5 Main challenges of machine learning
9.5.1 Insufficient quantity of training data
9.5.2 Non-representative training data
9.5.3 Quality of data
9.5.4 Irrelevant features
9.5.5 Overfitting or underfitting on training data
References
10 - Systems and network biology
10.1 Introduction
10.2 Network theory
10.3 Graph theory
10.4 Features of biological networks
10.4.1 The various types of network edges
10.4.2 Network measures
10.4.3 Network models
10.5 Types of biological networks
10.5.1 Cell signaling networks
10.5.2 Gene/transcription regulation networks
10.5.3 Genetic interaction networks
10.5.4 Metabolic networks
10.5.5 Protein–protein interaction networks
10.6 Sources of data for biological networks
10.7 Gene ontology for network analysis
10.8 Analysis of biological networks and interactomes
10.9 Interaction network construction using a gene list
10.10 Data analysis tools
10.10.1 The InnateDB
10.10.2 Visualization and download of networks
10.10.3 Enrichr
10.10.3.1 PANTHER
10.10.3.2 GESA
10.10.3.3 DAVID
10.10.3.4 Babelomics 5
10.11 Network visualization tools
10.11.1 Cytoscape
10.11.2 NAViGaTOR
10.11.3 VisANT
10.11.4 CellDesigner
10.11.5 Pathway Studio
10.11.6 Gephi
10.12 Important properties to be inferred from networks
10.12.1 Hubs
10.12.2 Bottlenecks
10.12.3 Modules
10.12.4 Bioinformatics tools to detect modules, bottlenecks and hubs
References
11 - Bioinformatics workflow management systems
11.1 Introduction to workflow management systems
11.2 Galaxy
11.3 Gene pattern
11.4 KNIME: The Konstanz information miner
11.5 LINCS tools
11.5.1 The program's overall goal
11.5.2 Test performed under LINCS
11.6 Anduril bioinformatics and image analysis
11.6.1 Anduril image analysis: ANIMA
11.7 NextFlow
References
12 - Data handling using Python
12.1 Introduction
12.2 Datatypes and operators
12.2.1 Datatypes
12.2.2 Operators
12.3 Variables
12.4 Strings
12.4.1 String indexing
12.4.2 Operations on strings
12.4.3 Methods in strings
12.5 Python lists and tuples
12.5.1 Accessing values in list
12.5.2 Methods with lists
12.5.3 Tuples
12.6 Dictionary in Python
12.7 Conditional statements
12.7.1 Logical operators
12.7.2 If and else statements
12.8 Loops in Python
12.8.1 While loop
12.8.2 “For” loop
12.8.3 Breaking a loop
12.9 File handling in Python
12.9.1 Specify file mode
12.10 Importing functions
12.10.1 Running a t-test in Python
12.10.2 Make a simple scatterplot in matplotlib
12.10.3 Running a simple linear regression in Python
12.11 Data handling
References
Index
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
Back Cover