This volume provides protocols for computational, statistical, and machine learning methods that are mainly applied to the study of metabolic engineering, synthetic biology, and disease applications. These techniques support the latest progress in cross-disciplinary research that integrates the different scales of biological complexity. The topics covered in this book are geared toward researchers with a background in engineering, computational analytical, and modeling experience and cover a broad range of topics in computational and machine learning approaches. Written in the highly successful Methods in Molecular Biology series format, chapters include introductions to their respective topics, lists of the necessary materials and reagents, step-by-step, readily reproducible laboratory protocols, and tips on troubleshooting and avoiding known pitfalls.
Comprehensive and practical, Computational Biology and Machine Learning for Metabolic Engineering and Synthetic Biology is a valuable resource for any researcher or scientist who wants to learn more about the latest computational methods and how they are applied toward the understanding and prediction of complex biology.
Author(s): Kumar Selvarajoo
Series: Methods in Molecular Biology, 2553
Publisher: Humana Press
Year: 2022
Language: English
Pages: 456
City: New York
Preface
Contents
Contributors
Chapter 1: Challenges to Ensure a Better Translation of Metabolic Engineering for Industrial Applications
1 Introduction
1.1 Historical Prospective
1.2 Economic Constraints and a Requirement for Underpinning Multidisciplinary Outlook
1.3 Optimizing Complex Branched Secondary Metabolite Pathways
1.4 Genetic Stability
1.5 The Concept of Genomic Safe Harbors
1.6 Identification of Integration Loci in Microbial Cell Factories
1.7 Genome Editing, a Powerful Tool to Target Desired Sites
1.8 Creation of Landing Pads
1.9 Creating Fermentation-Friendly Chassis Organisms
2 Conclusions
References
Chapter 2: Synthetic Biology Meets Machine Learning
1 Introduction
2 Computational Tools for Synthetic Biology Applications
2.1 Machine Learning for Cell and Protein Engineering
2.2 Computational Tools for Metabolic Engineering
3 Conclusions and Future Outlooks
References
Chapter 3: Design and Analysis of Massively Parallel Reporter Assays Using FORECAST
1 Introduction
2 Materials
2.1 Software Dependencies
2.2 Installation
3 Methods
3.1 Simulation of MPRA Experiments
3.2 Inferring Construct Performance from MPRA Data
3.3 Optimizing the Design of MPRA Experiments
3.4 Assessing the Accuracy of the Inferred Distributions
3.5 Visually Comparing Individual Inferred Distributions
4 Notes
References
Chapter 4: Modeling Protein Complexes and Molecular Assemblies Using Computational Methods
1 Introduction
2 Methods for Building a 3D Model of a Protein
2.1 Template-Based Methods
2.2 Template-Free or Ab Initio Methods
2.3 Servers for Protein Structure Prediction and Related Databases
2.3.1 MODELLER via ModWeb and ModBase
2.3.2 PHYRE2
2.3.3 I-TASSER
2.3.4 trRosetta
2.3.5 AlphaFold2 Method and Structural Database
3 Protein-Protein Interaction Prediction Using Coevolution
4 Protein Assembly Prediction and Analysis
4.1 Protein-Protein Docking: Principles and Methods
4.2 ZDOCK
4.3 InterEvDock3
5 Case Study: Modeling the Succinate-Quinone Oxidoreductase Heterocomplex
5.1 Building a 3D Model Using AlphaFold2: SQR Subunits, SdhA, and SdhC
5.2 Building a 3D Model Using I-TASSER: SQR Subunits, SdhB, and SdhD
5.3 Modeling SdhA-SdhB and SdhC-SdhD Using Protein-Protein Docking and Coevolution Information
5.4 Modeling the Succinate-Quinone Oxidoreductase Heterocomplex Using Protein-Protein Docking and Restraints
6 Conclusions
References
Chapter 5: From Genome Mining to Protein Engineering: A Structural Bioinformatics Route
1 Introduction
1.1 Genome Mining: Finding Your Needle in a Haystack
1.2 Protein Structure Prediction: Then and Now
1.2.1 Programs and Servers for Comparative Modeling
1.2.2 The Rise of AlphaFold
1.3 Computational Docking of Small Molecules
1.3.1 Available Docking Software
1.4 Protein Engineering for Alteration of Functional Properties
2 Methods
2.1 Searching Genomes and Proteomes with BLAST and HMMER
2.1.1 Searching for Biosynthetic Gene Clusters with antiSMASH
2.2 Modeling Protein Structures with AlphaFold Through Google Colab
2.3 Small Molecule Docking with AutoDock Vina
2.4 Protein Engineering Strategies for Structural Models
2.4.1 General Stability
2.4.2 Activity/Specificity
2.4.3 Thermostability
3 Conclusion
4 Notes
References
Chapter 6: Creating De Novo Overlapped Genes
1 Introduction
2 Materials
2.1 Hardware
2.2 Software
3 Methods
3.1 Choose Protein Sequences to Overlap
3.2 Download Target Protein and Coding Sequences
3.3 Gathering Additional Sequences with HHblits
3.4 Gathering Additional Sequences with PSI-BLAST
3.5 Perform Multiple Sequence Alignment Using MAFFT
3.6 Creation of a Protein Generative Model
3.6.1 Training HMM Using Hmmer
3.6.2 Training Markov Random Field Using CCMpred
3.6.3 Summarizing Pseudo-Likelihoods/Energies
3.6.4 Setting Up Folder Structure
3.6.5 Running CAMEOS
3.6.6 Evaluating CAMEOS Results
3.7 Putting It All Together
4 Notes
References
Chapter 7: Design of Gene Boolean Gates and Circuits with Convergent Promoters
1 Introduction
2 Methods
2.1 Convergent Promoters
2.2 Boolean Gates Based on Convergent Promoters
2.2.1 Two-Input Boolean Gates
2.2.2 Three-Input Boolean Gates
2.3 Modeling Transcription Interference
2.3.1 The Model by Sneppen and Co-authors
2.3.2 Occlusion
2.3.3 Sitting Duck
2.3.4 Collision
2.4 Modeling and Constructing Logic Gates via Transcriptional Interference
2.5 RNA Polymerase II Collision and Composable Parts
2.5.1 RNApII Collision Without Transcription Regulation: A Simple Transcription Unit
2.5.2 Modeling a NOT Gate
2.5.3 Modeling a Two-Input NOR Gate
3 Conclusions
4 Notes
References
Chapter 8: Computational Methods for the Design of Recombinase Logic Circuits with Adaptable Circuit Specifications
1 Introduction
2 Methods
2.1 Circuit Specification
2.2 Logic Function
2.2.1 Boolean Logic
2.2.2 History-Dependent Logic
2.3 The CALIN Web Interface for Multicellular Design
2.3.1 Asynchronous Boolean Logic
2.3.2 History-Dependent Logic
2.4 RECOMBINATOR Database for Single-Layer Design
3 Conclusion
References
Chapter 9: Designing a Model-Driven Approach Towards Rational Experimental Design in Bioprocess Optimization
1 Introduction
2 Materials and Methods
2.1 Plasmid Construction
2.2 Growth Decoupling Strategy
2.3 Flask Study Characterization
2.4 Bioreactor Study Characterization
2.5 Cell Factory Kinetic Modeling
2.6 Flask-Scale Model Development and Validation
2.7 Bioreactor-Scale Cell Model Development and Validation
2.8 CFD Simulation Setup and Integrated Modeling
2.9 CFD Results Visualization
References
Chapter 10: Modeling Subcellular Protein Recruitment Dynamics for Synthetic Biology
1 Introduction
2 Materials
2.1 Personal Computer
3 Methods
3.1 Modeling Recruitment Kinetics and Endpoint Dynamics
3.2 Analyzing Efficiency of Recruitment
3.3 Modeling Spatial Dynamics of Recruitment
3.4 Analyzing Recruitment and Diffusional Spread
4 Notes
References
Chapter 11: Genome-Scale Modeling and Systems Metabolic Engineering of Vibrio natriegens for the Production of 1,3-Propanediol
1 Introduction
2 Materials
2.1 Data Sources and Software
2.2 Strain and Media Recipes
2.3 Plasmid and DNA Cassette
3 Method
3.1 Genome-Scale Modeling
3.1.1 Draft Reconstruction
3.1.2 Reconstruction Refinement
3.1.3 Reconstruction of the Mathematical Model
3.1.4 Network Evaluation
3.1.5 Data Assembly and Dissemination
3.2 Gene-Editing Protocol (Fig. 2)
3.2.1 Introduction of Plasmid pXMJ19-tfoX
3.2.2 Natural Transformation
3.2.3 Elimination of the Selection Marker and Curation of Plasmid pXMJ19-tfoX
3.3 Systems Metabolic Engineering of V. natriegens for the Production of 1,3-Propanediol
4 Notes
References
Chapter 12: Application of GeneCloudOmics: Transcriptomic Data Analytics for Synthetic Biology
1 Introduction
2 Materials
3 Methods
3.1 Overview of GeneCloudOmics Server
3.1.1 Supports Different Transcriptomic Data Types
3.1.2 Provides Multiple Preprocessing Methods
3.1.3 Performs Nine Biostatistical Tests
3.1.4 Identifying Differentially Expressed Genes (DEG)
3.1.5 Interprets and Analyzes Gene and Protein Lists
3.1.6 Creates a Customized Analysis Report
3.2 Statistical Tests in Transcriptomic Data Analysis
3.2.1 Scatter Plot
3.2.2 Distribution Fitting
3.2.3 Correlation
Pearson Correlation
Spearman Correlation
3.2.4 PCA
3.2.5 Heatmap and Gene Clustering
3.2.6 Transcriptome-Wide Average Noise
3.2.7 Entropy
3.2.8 Random Forest Clustering
3.2.9 Self-Organizing Map (SOM)
3.2.10 t-Distributed Stochastic Neighbor Embedding (t-SNE)
3.3 Bioinformatics Analysis of the Differentially Expressed Genes
3.3.1 Gene Ontology (GO) Annotation
3.3.2 Pathway Enrichment Analysis
3.3.3 Protein-Protein Interaction
3.3.4 Complex Enrichment
3.3.5 Protein Function
3.3.6 Protein Subcellular Localization
3.3.7 Protein Domains
3.3.8 Tissue Expression
3.3.9 Gene Co-expression
3.3.10 Protein Physicochemical Properties
3.3.11 Protein Evolutionary Analysis
3.3.12 Protein Pathological Analysis
3.4 Transcriptomic Data Analysis Using GeneCloudOmics
3.4.1 The Required Data
3.4.2 Importing or Uploading Data to GeneCloudOmics
3.4.3 Data Preprocessing
3.4.4 Biostatistical Analysis of Normalized Data
Scatter Plot
Distribution Fitting
Principal Component Analysis (PCA)
Correlation
Pearson Correlation
Spearman Correlation
Noise Analysis
Shannon Entropy
3.4.5 Differential Gene Expression Analysis
DE Analysis
Heatmap
Self-Organizing Map (SOM)
t-Distributed Stochastic Neighbor Embedding (t-SNE)
3.4.6 Bioinformatics and Functional Annotations of the DEG
Gene Ontology (GO) Association Analysis
Pathway Enrichment Analysis
Protein-Protein Interactions
Protein Functions and Subcellular Localization
Protein and Gene Properties
3.4.7 Result Interpretation
4 Notes
5 Conclusions
References
Chapter 13: Overview of Bioinformatics Software and Databases for Metabolic Engineering
1 Introduction
2 A Modular View of the Metabolome
3 Metabolic Pathway Databases and Tools
3.1 Pathway Databases
3.1.1 Informatics Access of Metabolic Networks: An In-Depth Example
3.1.2 Analysis of Pathway Enrichment
3.2 Drug Compound Databases
4 Linking Metabolomic Data with High-Throughput Omics Profiles
5 Conclusions and Future Directions
References
Chapter 14: Computational Simulation of Tumor-Induced Angiogenesis
1 Introduction
2 Methods
2.1 Protocols
2.1.1 Initialization
2.2 Implementation
2.3 Simulated Results
References
Chapter 15: Computational Methods and Deep Learning for Elucidating Protein Interaction Networks
1 Introduction
1.1 Protein Interaction Identification Methods
1.2 Machine Learning
1.2.1 Evaluation of Machine Learning Models
1.3 Protein Interaction Databases
1.3.1 Primary Databases
1.3.2 Secondary Databases
2 Methods
2.1 Feature Extraction
2.1.1 Sequence Features
2.1.2 Evolutionary Features
2.1.3 Domain-Based Features
2.1.4 Motif-Based Features
2.1.5 Other Structural Features
2.1.6 Network Topology-Based Features
2.1.7 Feature Extraction and Encoding of Other Binding Partners: DNA, RNA, and Small Molecules
Nucleic Acids: DNA and RNA
Small Molecules
2.2 Applications
2.2.1 PPI Networks to Understand Disease
2.2.2 Protein Function Prediction
2.2.3 Protein-Drug Interaction Site Prediction Using PPIs
2.3 Template-Based Methods of Protein Interaction Prediction
2.3.1 Homology-Based Approaches
2.3.2 Interface-Based Methods
2.3.3 Gene-Based Methods
2.3.4 Network Topology-Based Approaches
2.4 Learning-Based Methods to Identify Protein Interactions
2.4.1 Machine Learning-Based Methods
2.4.2 Decision Tree-Based Method
2.4.3 Probabilistic/Bayesian Classification
2.4.4 Artificial Neural Networks
2.4.5 Clustering
2.5 Challenges and Limitations
2.5.1 Reliability of Protein Interaction Data
2.5.2 Data Integration
2.5.3 Dynamic Protein Network Construction
2.5.4 Evaluation of Protein Interaction Networks
2.5.5 Lack of Data
2.5.6 Overfitting
2.5.7 Data Imbalance
2.5.8 Lack of Interpretability
2.5.9 Uncertainty Scaling
2.5.10 Catastrophic Forgetting
2.6 Standard Modeling Protocol
2.6.1 Computing Resources
Software Installations
Machine Learning Frameworks
2.6.2 Data Processing, Model Building, and Evaluation
2.7 Perspectives
References
Chapter 16: Machine Learning Methods for Survival Analysis with Clinical and Transcriptomics Data of Breast Cancer
1 Introduction
2 Backgrounds
2.1 Survival Analysis
2.2 Cox Proportional Hazards Model
2.3 Machine Learning Models
2.3.1 Random Survival Forests
2.3.2 Gradient Boosted Survival
2.3.3 Survival Support Vector Machine
2.4 Feature Selection
3 Methods
3.1 Dataset
3.2 Study Design
3.3 Initial Setting
3.4 Experiment 1: Clinical Data
3.4.1 Load Data
3.4.2 Preprocess and Explore Data
3.4.3 Plot Cox Proportional Hazards Model
3.4.4 Set Up and Evaluate Machine Learning Algorithms
3.4.5 Interpret Model
3.5 Experiment 2: Transcriptomic Data
3.5.1 Load Data
3.5.2 Preprocess Data
3.5.3 Feature Selection
3.5.4 Plot Cox Proportional Hazards Model
3.5.5 Set Up and Evaluate Machine Learning Algorithms
3.5.6 Interpret Model
3.6 Experiment 3: Integrating Clinical to Transcriptomic Data
3.6.1 Load Data
3.6.2 Preprocess and Explore Data
3.6.3 Plot Cox Proportional Hazards Model
3.6.4 Set Up and Evaluate Machine Learning Algorithms
3.6.5 Interpret Model
4 Conclusions
References
Chapter 17: Machine Learning Using Neural Networks for Metabolomic Pathway Analyses
1 Introduction
1.1 Metabolomics
1.2 Machine Learning
1.3 ML Applications in Metabolomics
2 Methods
2.1 Neural Network
2.2 Training a Neural Network
3 Illustrative Example
3.1 Dataset
3.2 Dataset Preparation and Feature Engineering
3.2.1 Standardization
3.2.2 Clustering Analysis
3.2.3 Descriptor Generation
3.3 Model Setup and Training
3.4 Model Performance
4 Conclusion
References
Chapter 18: Machine Learning and Hybrid Methods for Metabolic Pathway Modeling
1 Introduction
1.1 From Mechanistic to ML Models, There, and Back Again
1.2 Improving Cell Metabolism Modeling with ML
1.2.1 Integration of in Silico Mechanistic Modeling Results with Other Omics Data
1.2.2 Determination of Parameters for Mechanistic Models from Data- or Theory-Driven ML
2 Materials
3 Methods
3.1 Using Mechanistic Models to Produce Data for Incorporation into ML Classifiers
3.1.1 MS-Based Lipidomic and Metabolomic Data
3.1.2 NMR-Based Data
3.2 Prepare Omics Data for Further Model Development
3.3 Develop a Mechanistic Model of Metabolic Processes of Interest
3.4 Integrate Mechanistic Model of Metabolic Processes with ML
3.5 Examples of Methods
3.6 Determination of Parameters for Mechanistic Models from Data- or Theory-Driven ML
References
Chapter 19: A Machine Learning-Based Approach Using Multi-omics Data to Predict Metabolic Pathways
1 Introduction
2 Methods
2.1 Early Concatenation of Data
2.2 Concatenation of Data at Later a Stage
2.3 Integration of Data as Transformation
3 Application
3.1 Supervised Learning
3.1.1 Test Case: Deep Learning-Based Multi-omics Integration
3.2 Unsupervised Learning
3.2.1 Test Case: Multidata Integration of Metabolomics and Transcriptomics to Reveal the Modulation Network of Cell Regulation
4 Notes
5 Summary
References
Index