Introduction To Computational Metagenomics

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Breakthroughs in high-throughput genome sequencing and high-performance computing technologies have empowered scientists to decode many genomes including our own. Now they have a bigger ambition: to fully understand the vast diversity of microbial communities within us and around us, and to exploit their potential for the improvement of our health and environment. In this new field called metagenomics, microbial genomes are sequenced directly from the habitats without lab cultivation. Computational metagenomics, however, faces both a data challenge that deals with tens of tera-bases of sequences and an algorithmic one that deals with the complexity of thousands of species and their interactions. This interdisciplinary book is essential reading for those who are interested in beginning their own journey in computational metagenomics. It is a prism to look through various intricate computational metagenomics problems and unravel their three distinctive aspects: metagenomics, data engineering, and algorithms. Graduate students and advanced undergraduates from genomics science or computer science fields will find that the concepts explained in this book can serve as stepping stones for more advanced topics, while metagenomics practitioners and researchers from similar disciplines may use it to broaden their knowledge or identify new research targets.

Author(s): Zhong Wang
Publisher: World Scientific Publishing
Year: 2022

Language: English
Pages: 209
City: Singapore

Contents
Preface
Acknowledgments
Chapter 1. Computational Metagenomics: A Metagenomics Perspective
1.1 Metagenome and Metagenomics
1.2 Metagenomics: Key Scientific Questions
1.2.1 Who is out there?
1.2.2 What are they doing?
1.2.3 How do they interact?
1.3 Metagenome Sequencing: Strategies
1.3.1 Targeted amplicon sequencing
1.3.2 Whole metagenome sequencing
1.3.3 Single-cell amplification genome sequencing
1.4 Metagenome Sequencing: Platforms
1.4.1 Illumina
1.4.2 Pacific Biosciences
1.4.3 Oxford Nanopore Technology
1.4.4 Emerging technologies
1.5 Metagenomics: A Great Promise with Abundant Caution
Chapter 2. Computational Metagenomics: A Data Engineering Perspective
2.1 An Overview of Metagenomics Data Management
2.2 Types of Data in Metagenomics
2.2.1 Data types in the context of metagenomics
2.2.1.1 DNA sequence data
2.2.1.2 Annotation data
2.2.1.3 Metadata
2.2.2 Data types in the context of data engineering
2.2.2.1 CSV and JSON
2.2.2.2 Databases
2.2.2.3 HDF5
2.2.3 Parquet and Arrow
2.2.4 Data types in the context of algorithms
2.2.4.1 Structured vs unstructured data
2.2.4.2 Dense vs sparse data
2.3 Data Governess
2.3.1 Location, location, location
2.3.1.1 On-premises data warehouse
2.3.1.2 Cloud data repository
2.3.1.3 Data silos (such as someone’s external hard drive)
2.3.1.4 A comparison of storage solutions
2.3.2 Data ownership and usage policy
2.4 Data Transfer
2.5 Metagenomics Data Management: Future Perspectives and Cautions
Chapter 3. Computational Metagenomics: An Algorithmic Perspective
3.1 kmer
3.1.1 kmer frequency as sequence representation
3.1.2 kmer graph as sequence representation
3.2 Read
3.3 Contig
3.4 Genome
3.5 Metagenome
3.6 Conclusion and Future Perspectives
Chapter 4. Hardware and Software Aspects for Scalable Analysis
4.1 Hardware Scaling
4.1.1 Managed or hosted hardware scaling on the cloud
4.1.1.1 Amazon Machine Image (AMI)
4.1.1.2 Amazon Elastic Compute Cloud (EC2)
4.1.1.3 Elastic Block Storage (EBS), Elastic File System (Amazon EFS) and Simple Storage System (S3)
4.1.1.4 Virtual Private Cloud (VPC)
4.2 Software Scaling
4.2.1 Task parallelism
4.2.1.1 Open MP
4.2.1.2 Message Passing Interface (MPI)
4.2.2 Data parallelism
4.2.2.1 Apache Hadoop
4.2.2.2 Apache Spark
4.3 Future Perspectives
Chapter 5. Metagenomics Data Quality Improvement
5.1 Removing Common Errors
5.1.1 Errors in sample metadata and annotation Data
5.1.1.1 Supervised methods for metadata correction
5.1.1.2 Unsupervised methods for metadata correction
5.1.2 Errors in sequence data
5.1.2.1 Identifying errors in short reads using Bloom filter
5.1.2.2 Long-read error correction
5.2 Missing Data Imputation
5.3 Remove Irrelevant Data: Data Filtering
5.4 Control Noise and Biases
5.5 Pitfalls in Metagenomics Data Improvement
Chapter 6. Exploring Community Diversity: Taxonomic Analyses
6.1 What is Microdiversity?
6.1.1 Taxonomic diversity
6.1.1.1 Within sample diversity indices: alpha diversity
6.1.1.2 Between sample diversity indices: beta diversity
6.1.1.3 The effect of sampling depth in calculating diversity
6.1.2 Functional diversity
6.2 Taxonomic Classification in Metagenomics
6.2.1 Super features: phylogenetic markers
6.2.1.1 16S/18S sequencing
6.2.1.2 Ribosomal RNA databases
6.2.2 Features based on whole-genome statistics
6.2.2.1 ANI/AAI
6.2.2.2 Multiple phylogenetic markers
6.2.3 Features based on reads
6.3 Future Perspectives
Chapter 7. Functional Metagenomics: Gene andP athway-Based Anal
7.1 Gene Discovery
7.1.1 Discover genes based on metagenome assembly
7.1.2 Discover genes by protein assembly
7.2 Function Annotation
7.2.1 Alignment-based protein annotation
7.2.2 Model-based protein annotation
7.2.3 Detecting distant protein homology
7.3 Pathway Analysis
7.3.1 Common pathway databases
7.3.1.1 Kyoto Encyclopedia of Genes and Genomes (KEGG)
7.3.1.2 MetaCyc Metabolic Pathway Database
7.3.2 Metabolic pathway profiling
7.3.3 Pathway enrichment analysis
7.3.4 Discovering BGCs
7.4 Future Perspectives
Chapter 8. Deconvolute Community Metagenome into Single Genomes
8.1 An Overview of Metagenome Assembly
8.2 Challenges in Metagenome Assembly
8.2.1 Metagenomics challenges
8.2.1.1 Short read length vs repeats
8.2.1.2 Limited sequencing depth vs community diversity
8.2.1.3 Lack of reference genomes for assembly quality assessment
8.2.2 Data engineering challenges
8.2.3 Algorithmic challenges
8.3 Metagenome Assembly
8.3.1 Metagenome de Bruijn graph construction
8.3.2 Metagenome de Bruijn graph simplification
8.3.3 Parallel graph construction and traversal
8.3.4 Long reads and other types of graphs
8.3.5 Coassembly vs multiassembly
8.4 Metagenome Binning
8.4.1 Sequence composition
8.4.2 Contig abundance
8.4.3 Ensemble binning
8.5 Metagenome Clustering
8.6 Genome Quality Assessment
8.7 Metagenome Assembly in the Context of Rapid Evolving Technology: Longer Reads, Longer Range
8.7.1 Longer reads
8.7.1.1 Synthetic long reads
8.7.1.2 Single-molecule, long read sequencing
8.7.2 Longer range
8.8 Future Perspectives: A Roadmap for a Finished Metagenome Assembly
Chapter 9. Single Cell Metagenomics
9.1 Single-cell Amplified Genome (SAG)
9.2 Unique Challenges and Solutions Associated with SAG Assembly
9.2.1 Contamination
9.2.2 Uneven coverage
9.2.3 Drop-out genomic regions
9.2.4 Chimeras
9.3 Leveraging MAGs for SAGs, the Best of two Worlds?
9.4 Future Perspectives
Chapter 10. Interactions Between Microbes and Their Environment
10.1 Interactions Within a Community
10.1.1 Identifying phage-bacteria pairs
10.1.2 Identifying other types of relationships
10.2 The Impact of Microbial Communities on Their Environment
10.2.1 Enterotype-based study of microbe-host interactions
10.2.2 Function-based study of microbe-host interactions
10.3 The Influence of Environment on Metagenome Communities
10.4 Advances in Microbial Community Modeling
10.4.1 Individual microbial genome models
10.4.2 Microbial community models
10.5 Future Perspectives
Bibliography
Index