Discover modern, next-generation sequencing libraries from the powerful Python ecosystem to perform cutting-edge research and analyze large amounts of biological data
Key Features
• Perform complex bioinformatics analysis using the most essential Python libraries and applications
• Implement next-generation sequencing, metagenomics, automating analysis, population genetics, and much more
• Explore various statistical and machine learning techniques for bioinformatics data analysis
Book Description
Bioinformatics is an active research field that uses a range of simple-to-advanced computations to extract valuable information from biological data, and this book will show you how to manage these tasks using Python.
This updated third edition of the Bioinformatics with Python Cookbook begins with a quick overview of the various tools and libraries in the Python ecosystem that will help you convert, analyze, and visualize biological datasets. Next, you'll cover key techniques for next-generation sequencing, single-cell analysis, genomics, metagenomics, population genetics, phylogenetics, and proteomics with the help of real-world examples. You'll learn how to work with important pipeline systems, such as Galaxy servers and Snakemake, and understand the various modules in Python for functional and asynchronous programming. This book will also help you explore topics such as SNP discovery using statistical approaches under high-performance computing frameworks, including Dask and Spark. In addition to this, you'll explore the application of machine learning algorithms in bioinformatics.
By the end of this bioinformatics Python book, you'll be equipped with the knowledge you need to implement the latest programming techniques and frameworks, empowering you to deal with bioinformatics data on every scale.
What you will learn
• Become well-versed with data processing libraries such as NumPy, pandas, arrow, and zarr in the context of bioinformatic analysis
• Interact with genomic databases
• Solve real-world problems in the fields of population genetics, phylogenetics, and proteomics
• Build bioinformatics pipelines using a Galaxy server and Snakemake
• Work with functools and itertools for functional programming
• Perform parallel processing with Dask on biological data
• Explore principal component analysis (PCA) techniques with scikit-learn
Who this book is for
This book is for bioinformatics analysts, data scientists, computational biologists, researchers, and Python developers who want to address intermediate-to-advanced biological and bioinformatics problems. Working knowledge of the Python programming language is expected. Basic knowledge of biology will also be helpful.
Author(s): Tiago Antao
Edition: 3
Publisher: Packt Publishing
Year: 2022
Language: English
Commentary: Publisher's PDF
Pages: 360
City: Birmingham, UK
Tags: Machine Learning; Python; Bioinformatics; Principal Component Analysis; Apache Spark; Pipelines; Docker; R; Apache Parquet; Ensemble Learning; matplotlib; Jupyter; HDF5; Anaconda; Population Genetics; Biopython; Next-Generation Sequence Data; Proteomics; Genome; Numba; Cython; GenBank; HTSeq; PLINK; Genepop; PyMOL; Galaxy; Apache Dask; Apache Arrow; Simulations; Zarr; Phylogenetics
Cover
Title Page
Copyright and Credits
Contributors
Table of Contents
Preface
Chapter 1: Python and the Surrounding Software Ecology
Installing the required basic software with Anaconda
Getting ready
How to do it...
There’s more...
Installing the required software with Docker
Getting ready
How to do it...
See also
Interfacing with R via rpy2
Getting ready
How to do it...
There’s more...
See also
Performing R magic with Jupyter
Getting ready
How to do it...
There’s more...
See also
Chapter 2: Getting to Know NumPy, pandas, Arrow, and Matplotlib
Using pandas to process vaccine-adverse events
Getting ready
How to do it...
There’s more...
See also
Dealing with the pitfalls of joining pandas DataFrames
Getting ready
How to do it...
There’s more...
Reducing the memory usage of pandas DataFrames
Getting ready
How to do it…
See also
Accelerating pandas processing with Apache Arrow
Getting ready
How to do it...
There’s more...
Understanding NumPy as the engine behind Python data science and bioinformatics
Getting ready
How to do it…
See also
Introducing Matplotlib for chart generation
Getting ready
How to do it...
There’s more...
See also
Chapter 3: Next-Generation Sequencing
Accessing GenBank and moving around NCBI databases
Getting ready
How to do it...
There’s more...
See also
Performing basic sequence analysis
Getting ready
How to do it...
There’s more...
See also
Working with modern sequence formats
Getting ready
How to do it...
There’s more...
See also
Working with alignment data
Getting ready
How to do it...
There’s more...
See also
Extracting data from VCF files
Getting ready
How to do it...
There’s more...
See also
Studying genome accessibility and filtering SNP data
Getting ready
How to do it...
There’s more...
See also
Processing NGS data with HTSeq
Getting ready
How to do it...
There’s more...
Chapter 4: Advanced NGS Data Processing
Preparing a dataset for analysis
Getting ready
How to do it…
Using Mendelian error information for quality control
How to do it…
There’s more…
Exploring the data with standard statistics
How to do it…
There’s more…
Finding genomic features from sequencing annotations
How to do it…
There’s more…
Doing metagenomics with QIIME 2 Python API
Getting ready
How to do it...
There’s more...
Chapter 5: Working with Genomes
Technical requirements
Working with high-quality reference genomes
Getting ready
How to do it...
There’s more...
See also
Dealing with low-quality genome references
Getting ready
How to do it...
There’s more...
See also
Traversing genome annotations
Getting ready
How to do it...
There’s more...
See also
Extracting genes from a reference using annotations
Getting ready
How to do it...
There’s more...
See also
Finding orthologues with the Ensembl REST API
Getting ready
How to do it...
There’s more...
Retrieving gene ontology information from Ensembl
Getting ready
How to do it...
There’s more...
See also
Chapter 6: Population Genetics
Managing datasets with PLINK
Getting ready
How to do it...
There’s more...
See also
Using sgkit for population genetics analysis with xarray
Getting ready
How to do it...
There’s more...
Exploring a dataset with sgkit
Getting ready
How to do it...
There’s more...
See also
Analyzing population structure
Getting ready
How to do it...
See also
Performing a PCA
Getting ready
How to do it...
There’s more...
See also
Investigating population structure with admixture
Getting ready
How to do it...
There’s more...
Chapter 7: Phylogenetics
Preparing a dataset for phylogenetic analysis
Getting ready
How to do it...
There’s more...
See also
Aligning genetic and genomic data
Getting ready
How to do it...
Comparing sequences
Getting ready
How to do it...
There’s more...
Reconstructing phylogenetic trees
Getting ready
How to do it...
There’s more...
Playing recursively with trees
Getting ready
How to do it...
There’s more...
Visualizing phylogenetic data
Getting ready
How to do it...
There’s more...
Chapter 8: Using the Protein Data Bank
Finding a protein in multiple databases
Getting ready
How to do it...
There’s more
Introducing Bio.PDB
Getting ready
How to do it...
There’s more
Extracting more information from a PDB file
Getting ready
How to do it...
Computing molecular distances on a PDB file
Getting ready
How to do it...
Performing geometric operations
Getting ready
How to do it...
There’s more
Animating with PyMOL
Getting ready
How to do it...
There’s more
Parsing mmCIF files using Biopython
Getting ready
How to do it...
There’s more
Chapter 9: Bioinformatics Pipelines
Introducing Galaxy servers
Getting ready
How to do it…
There’s more
Accessing Galaxy using the API
Getting ready
How to do it…
Deploying a variant analysis pipeline with Snakemake
Getting ready
How to do it…
There’s more
Deploying a variant analysis pipeline with Nextflow
Getting ready
How to do it…
There’s more
Chapter 10: Machine Learning for Bioinformatics
Introducing scikit-learn with a PCA example
Getting ready
How to do it...
There’s more...
Using clustering over PCA to classify samples
Getting ready
How to do it...
There’s more...
Exploring breast cancer traits using Decision Trees
Getting ready
How to do it...
Predicting breast cancer outcomes using Random Forests
Getting ready
How to do it…
There’s more...
Chapter 11: Parallel Processing with Dask and Zarr
Reading genomics data with Zarr
Getting ready
How to do it...
There’s more...
See also
Parallel processing of data using Python multiprocessing
Getting ready
How to do it...
There’s more...
See also
Using Dask to process genomic data based on NumPy arrays
Getting ready
How to do it...
There’s more...
See also
Scheduling tasks with dask.distributed
Getting ready
How to do it...
There’s more...
See also
Chapter 12: Functional Programming for Bioinformatics
Understanding pure functions
Getting ready
How to do it...
There’s more...
Understanding immutability
Getting ready
How to do it...
There’s more...
Avoiding mutability as a robust development pattern
Getting ready
How to do it...
There’s more...
Using lazy programming for pipelining
Getting ready
How to do it...
There’s more...
The limits of recursion with Python
Getting ready
How to do it...
There’s more...
A showcase of Python’s functools module
Getting ready
How to do it...
There’s more...
See also...
Index
Other Books You May Enjoy