In order to study living organisms, scientists not only study them at an overall macroscopic scale but also on a more detailed microscopic scale. This observation, pushed to its limits, consists of investigating the very center of each cell, where we find the molecules that determine the way it functions: DNA (deoxyribonucleic acid) and RNA (ribonucleic acid).
In an organism, DNA carries the genetic information, which is called the genome. It is represented as four-letter sequences using the letters A, C, G and T; based on these sequences, computer methods described in this book can answer fundamental questions in bioinformatics.
This book explores how to quickly find sequences of a few hundred nucleotides within a genome that may be made up of several billion, how to compare those sequences and how to reconstruct the complete sequence of a genome. It also discusses the problems of identifying bacteria in a given environment and predicting the structure of RNA based on its sequence.
Author(s): Annie Chateau, Mikael Salson
Series: Computer Science: Bioinformatics
Publisher: Wiley-ISTE
Year: 2022
Language: English
Pages: 267
City: London
Cover
Half-Title Page
Title Page
Copyright Page
Contents
Preface
Author Biographies
Chapter 1. Methodological Concepts: Algorithmic Solutions of Bioinformatics Problems
1.1. Data, models, problem formalism in bioinformatics
1.1.1. Data
1.1.2. Genome modeling
1.1.3. Problems in bioinformatics
1.2. Mathematical preliminaries
1.2.1. Propositional logic preliminaries
1.2.2. Preliminaries on sets
1.3. Vocabulary in text algorithmics
1.4. Graph theory
1.4.1. Subgraphs
1.4.2. Path in a graph
1.4.3. Matching
1.4.4. Planarity
1.4.5. Tree decomposition
1.5. Algorithmic problems
1.5.1. Definition
1.5.2. Graph problem
1.5.3. Satisfiability problems
1.6. Problem solutions
1.6.1. Algorithm
1.6.2. Complexity
1.6.3. Runtime
1.7. Complexity classes
1.7.1. Generality
1.7.2. Exact algorithms
1.7.3. Approximation algorithms
1.7.4. Solvers
1.8. Some algorithmic techniques
1.8.1. Dynamic programming
1.8.2. Tree traversal
1.9. Validation
1.9.1. The different types of errors
1.9.2. Quality measures
1.9.3. And in the non-binary case?
1.10. Conclusion
1.11. References
Chapter 2. Sequence Indexing
2.1. Introduction
2.1.1. What is indexing?
2.1.2. When to index?
2.1.3. What to index?
2.1.4. Indexing structures and queries considered
2.1.5. Basic notions and vocabulary
2.2. Word indexing
2.2.1. Bloom filters
2.2.2. Inverted list
2.2.3. De Bruijn graphs
2.2.4. Efficient structures for targeted queries
2.3. Full-text indexing
2.3.1. Suffix tree
2.3.2. (Extended) suffix array
2.3.3. Burrows–Wheeler transform
2.4. Indexing choice criteria
2.4.1. Based on the type of the necessary query
2.4.2. Based on the space-time and data quantity trade-off
2.4.3. Based on the need to add or modify indexed data
2.4.4. Indexing choices according to applications
2.5. Conclusion and perspectives
2.5.1. Efficient methods for indexing a few genomes or sequencing sets
2.5.2. Methods that struggle to take advantage of data redundancy
2.6. References
3. Sequence Alignment
3.1. Introduction
3.1.1. What is pairwise alignment?
3.1.2. How to evaluate an alignment?
3.2. Exact alignment
3.2.1. Representation in edit graph form
3.2.2. Global alignment and Needleman–Wunsch algorithm
3.2.3. Local alignment and Smith–Waterman algorithm
3.2.4. Alignment with affine indel function and the Gotoh
3.3. Heuristic alignment
3.3.1. Seeds
3.3.2. Min-hash and global sampling
3.3.3. Minimizing and local sampling
3.4. References
Chapter 4. Genome Assembly
4.1. Introduction
4.2. Sequencing technologies
4.2.1. Short reads
4.2.2. Long reads
4.2.3. Linked reads
4.2.4. Hi-C reads
4.2.5. Optical mapping
4.3. Assembly strategies
4.3.1. The main steps
4.3.2. Cleaning and correction of reads
4.3.3. Scaffold construction
4.3.4. Scaffold ordering
4.4. Scaffold construction methods
4.4.1. Greedy assembly
4.4.2. OLC assembly
4.4.3. DBG assembly
4.4.4. Constrained assembly
4.5. Scaffold-ordering methods
4.5.1. Hi-C data-based methods
4.5.2. Optical mapping-based methods
4.6. Assembly validation
4.6.1. Metrics
4.6.2. Read realignment
4.6.3. Gene prediction
4.6.4. Competitions
4.7. Conclusion
4.8. References
Chapter 5. Metagenomics and Metatranscriptomics
5.1. What is metagenomics?
5.1.1. Motivations and historical context
5.1.2. The metagenomic data
5.1.3. Bioinformatics challenges for metagenomics
5.2. “Who are they”: taxonomic characterization of microbial communities
5.2.1. Methods for targeted metagenomics
5.2.2. Whole-genome methods with reference
5.2.3. Reference-free methods
5.3. “What are they able to do?”: functional metagenomics
5.3.1. Gene prediction and annotation
5.3.2. Metatranscriptomics
5.3.3. Reconstruction of metabolic networks
5.4. Comparative metagenomics
5.4.1. Comparative metagenomics with diversity estimation
5.4.2. De novo comparative metagenomics
5.5. Conclusion
5.6. References
Chapter 6. RNA Folding
6.1. Introduction
6.1.1. RNA folding
6.1.2. Secondary structure
6.2. Optimization for structure prediction
6.2.1. Computing the minimum free-energy (MFE) structure
6.2.2. Listing (sub)optimal structures
6.2.3. Comparative prediction: simultaneous alignment/folding of RNAs
6.2.4. Joint alignment/folding model
6.3. Analyzing the Boltzmann ensemble
6.3.1. Computing the partition function
6.3.2. Statistical sampling
6.3.3. Boltzmann probability of structural patterns
6.4. Studying RNA structure in practice
6.4.1. The Turner model
6.4.2. Tools
6.5. References
Conclusion
List of Authors
Index
EULA