Variant Calling: Methods and Protocols

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

This volume provides practical guidance on a variety of techniques and steps to ensure successful variant calling. Chapters detail methods for variant calling from single-nucleotide variants to structural variants, variant calling in specialized data types such as RNA-seq and UMI-tagged sequencing, alignment-free genotyping and SNP calling, variant detection in single-cell DNA sequencing data, variant annotation, and preanalytical quality control to ensure successful variant calling. Written in the format of the highly successful Methods in Molecular Biology series, each chapter includes an introduction to the topic, lists step-by-step protocol to execute the algorithms, describes the input and output data, and includes tips on troubleshooting and known pitfalls.

 

Authoritative and cutting-edge, Variant Calling: Methods and Protocols aims to be a foundation for future studies and to be a source of inspiration for new investigations in the field.  

Author(s): Charlotte Ng, Salvatore Piscuoglio
Series: Methods in Molecular Biology, 2493
Publisher: Humana Press
Year: 2022

Language: English
Pages: 351
City: New York

Preface
Contents
Contributors
Chapter 1: Data Processing and Germline Variant Calling with the Sentieon Pipeline
1 Introduction
2 Materials
2.1 Platform Requirements
2.2 Installation Procedure for a Linux-Based System
3 Methods
3.1 Typical Usage for DNAseq
3.1.1 General
3.1.2 Map Reads to the Reference
3.1.3 Calculate Quality Metrics
3.1.4 Remove Duplicates
3.1.5 Base Quality Score Recalibration (BQSR)
3.1.6 Variant Calling
3.2 Typical Usage for DNAscope
3.2.1 General
3.2.2 Map Reads to the Reference, Calculate Quality Metrics, and Remove Duplicates
3.2.3 Germline Variant Calling Using a Machine Learning Model
3.3 Limitations of the Machine Learning Model
3.4 Working with Multiple Input Files from the Same Sample
3.5 Complete Pipeline for Processing Whole-Genome Sequencing Data
3.6 Processing Data Coming from Exome Sequencing
4 Benchmark Results
4.1 Performance for DNAseq
4.1.1 Input Data Files
4.1.2 Accuracy Benchmarking Results
4.2 Performance for DNAscope
4.2.1 Input Data Files
4.2.2 Accuracy Benchmarking Results
5 Notes
References
Chapter 2: MuSE: A Novel Approach to Mutation Calling with Sample-Specific Error Modeling
1 Introduction
2 Software
2.1 Pre-Processing
2.2 Download and Install MuSE
2.3 Input Requirements
3 Methods
3.1 Running MuSE
3.2 Output
4 Notes
References
Chapter 3: Octopus: Genotyping and Haplotyping in Diverse Experimental Designs
1 Introduction
2 Software
2.1 Requirements
2.2 Installation
3 Methods
3.1 Basic Usage
3.1.1 General Options
3.1.2 Read Pre-processing
3.1.3 Variant Discovery Options
3.1.4 Haplotype Generating Options
3.1.5 General Variant Calling Options
3.1.6 Variant Filtering Options
3.2 Calling Germline Variants in Individuals
3.3 Calling Variants in Cohorts
3.4 Calling Germline and De Novo Mutations in Trios
3.5 Calling Germline and Somatic Mutations in Bulk Tumors
3.5.1 Paired Tumor-Normal Samples
3.5.2 Reporting Phased Somatic Variants
3.5.3 Tumor-Only Samples
3.6 Calling Variants in Haploid Polyclonal Samples
3.7 Calling Germline and Somatic Variants in Single Cells
3.8 Using Config Files
3.9 Making Realigned Evidence BAMs
3.10 Calling Long Haplotypes
4 Notes
References
Chapter 4: Accurate Ensemble Prediction of Somatic Mutations with SMuRF2
1 Introduction
2 Materials
2.1 Environment
2.2 Input Requirements
2.3 Test Dataset
3 Methods
3.1 Running SMuRF2
3.2 Retrieving Gene Annotation Information
3.3 Tweaking the Precision and Recall of SMuRF2
3.4 Interpreting and Saving SMuRF2 Output
4 Notes
References
Chapter 5: Detecting Medium and Large Insertions and Deletions with transIndel
1 Introduction
2 Software
2.1 Environment
2.2 Input Data
2.3 Reference Genome and Gene Annotation Files
3 Methods
3.1 BAM File Refinement with Corrected CIGAR String
3.2 Indel Detection
3.3 TransIndel Output Interpretation
4 A Running Example
4.1 Data Preparation
4.2 Running transIndel
5 Notes
References
Chapter 6: DECoN: A Detection and Visualization Tool for Exonic Copy Number Variants
1 Introduction
2 Materials
2.1 Software
2.2 Download and Install DECoN
3 Methods
3.1 Step 1: Reading BAM Files to Generate Coverage Metrics
3.1.1 Inputs
3.1.2 Running ReadInBams
3.1.3 Output
3.2 Step 2: Running Quality Checks
3.2.1 Inputs
3.2.2 Running IdentifyFailures
3.2.3 Outputs
3.3 Step 3: Calling Exon CNVs
3.3.1 Inputs
3.3.2 Running makeCNVcalls
3.3.3 Output
3.4 Step 4: Visualizing Calls
3.4.1 Input
3.4.2 Running DECoN Call Visualization
3.4.3 Output
4 Notes
References
Chapter 7: FACETS: Fraction and Allele-Specific Copy Number Estimates from Tumor Sequencing
1 Introduction
1.1 Theoretical Methods
1.2 Joint Segmentation and Estimation of Integer Copy Number
2 Software
2.1 Installation
2.2 Input Requirements
2.3 Test Dataset
3 Methods
3.1 Getting Ready for FACETS
3.2 Segmentation
3.3 Estimate Tumor Purity, Ploidy and Allele-Specific Copy Number
3.4 Integrated Output
3.5 Assess the Fit by Spider Plot
4 Notes
References
Chapter 8: Meerkat: An Algorithm to Reliably Identify Structural Variations and Predict Their Forming Mechanisms
1 Introduction
2 Software
2.1 Prerequisites
2.2 Software Compilation
2.3 Reference Files
2.4 Example of Software and Reference File Arrangements
2.5 Input File
3 Methods
3.1 Pre_process.pl
3.2 Meerkat.pl
3.3 Mechanism.pl
3.4 Somatic_sv.pl
3.5 Discon.pl
3.6 Meerkat2vcf.pl
3.7 Fusions.pl
3.8 Primers.pl
4 Output Formats
4.1 List of Output Files
4.2 Format of prefix.variants File
4.3 Variants File Produced by discon.pl
4.4 VCF File Produced by meerkat2vcf.pl
4.5 Fusions File Produced by fusions.pl
4.6 Output by primers.pl
5 Notes
References
Chapter 9: Structural Variant Detection from Long-Read Sequencing Data with cuteSV
1 Introduction
2 Materials
2.1 Software
2.2 Hardware
2.3 Datasets
3 Methods
3.1 Discovering SVs Using Long-Read Alignments
3.2 Discovering SVs Using Diploid-Assembly Alignments
3.3 Cohort-Based SV Calling
4 Notes
References
Chapter 10: Identifying Somatic Mitochondrial DNA Mutations
1 Introduction
2 Materials
2.1 Input Data
2.2 Tools
3 Methods
3.1 Mitochondrial Region Extraction
3.2 Somatic Variant Calling
3.3 mtDNA-Specific Consideration and Filtering
3.4 Functional Annotation of mtDNA Variants
4 Notes
References
Chapter 11: Identification, Quantification, and Testing of Alternative Splicing Events from RNA-Seq Data Using SplAdder
1 Introduction
2 Software
2.1 Software Requirements
2.2 Download and Installation of SplAdder
2.2.1 Installation from PyPI
2.2.2 Installation from Source
3 Methods
3.1 The Build Mode
3.1.1 Understanding Key Parameters
3.1.2 Description of Output Files
3.2 The Test Mode
3.2.1 Understanding Key Parameters
3.2.2 Description of Output Files
3.3 The viz Mode
3.3.1 Understanding Key Parameters
3.3.2 Description of Output Files
4 Example Run on Public RNA-Seq Dataset
4.1 Download Example Data from SRA
4.2 Read Alignment to the Reference Genome
4.3 Construct Graphs and Detect Events with Spladder Build Mode
4.3.1 Graph Construction
4.3.2 Graph Quantification
4.3.3 Alternative Option: Simultaneous Quantification Across Samples
4.4 Differential Analysis Using SplAdder Test
4.5 Visualizing Splicing Events Using SplAdder viz
5 Notes
References
Chapter 12: PipeIT2: Somatic Variant Calling Workflow for Ion Torrent Sequencing Data
1 Introduction
2 Software
2.1 Tumor-Germline Workflow
2.2 Tumor-Only Workflow
3 Methods
3.1 Input Files
3.2 Parameters
3.3 Output File
4 Notes
References
Chapter 13: Variant Calling from RNA-seq Data Using the GATK Joint Genotyping Workflow
1 Introduction
2 Materials
2.1 Environment
2.2 Installing Bioinformatic Programs
2.2.1 The GATK Suite
2.2.2 The NCBI SRA Toolkit Programs
2.2.3 The STAR Aligner
2.2.4 The Picard Tools
2.2.5 Samtools, BCFtools, and HTSlib
2.3 Downloading Scripts Used in This Tutorial
2.4 Downloading Sequences from NCBI SRA
3 Methods
3.1 Mapping Reads to Reference (STAR 2-pass)
3.1.1 Generating Genome Index Files
3.1.2 Aligning RNAseq Reads (1st Mapping Pass)
3.1.3 Aligning RNAseq Reads (2nd Mapping Pass)
3.1.4 Sorting Alignment Files
3.1.5 Indexing Alignment Files
3.2 Adding Read Groups
3.3 Marking Duplicate Reads
3.4 Splitting RNAseq Reads
3.4.1 Preparing Reference Genome Files
3.4.2 Running SplitNCigarReads
3.5 Performing Base Quality Score Recalibration
3.5.1 Using BaseRecalibrator
3.5.2 Using ApplyBQSR to Adjust the Scores
3.5.3 Comparing Pre- and Post-recalibration Metrics
3.6 Joint Genotyping Variant Calling
3.6.1 Calling Variants Per-sample (GVCF Mode)
3.6.2 Merging of Files
3.6.3 Joint Genotyping
3.7 Querying VCF Files
3.8 Filtering Variants
3.8.1 Hard-Filtering with Annotation Fields
3.8.2 Selecting Variants for Association Studies
3.9 Examining and Visualizing Alignment Files
3.9.1 Visualizing Reads from Sample SRR5487372 at a Specific Locus
4 Notes
References
Chapter 14: UMI-Varcal: A Low-Frequency Variant Caller for UMI-Tagged Paired-End Sequencing Data
1 Introduction
2 Software
2.1 Dependencies
2.2 Download and Install UMI-VarCal
2.3 Usage
3 Methods
3.1 The Extraction Tool
3.1.1 Required Arguments
3.1.2 Optional Arguments
3.1.3 Example
3.1.4 Output
3.2 The Variant Calling Tool
3.2.1 Required Arguments
3.2.2 Optional Arguments
3.2.3 Example
3.2.4 Output
4 Notes
References
Chapter 15: Alignment-Free Genotyping of Known Variations with MALVA
1 Introduction
2 Materials
2.1 Software
2.2 Download and Install MALVA
3 Methods
3.1 Inputs, Outputs, and Parameters
3.2 Interpreting the Output File
3.3 Running MALVA on Humans (Diploid Organism)
3.4 Running MALVA on Viruses (Haploid Organism)
4 Notes
References
Chapter 16: Kmer2SNP: Reference-Free Heterozygous SNP Calling Using k-mer Frequency Distributions
1 Introduction
2 Materials
2.1 Software
3 Methods
3.1 Kmer2SNP: Single Sample Mode
3.1.1 Inputs and Parameters
3.1.2 Output Files and Their Interpretations
3.2 Kmer2SNP: Population Mode
3.2.1 Inputs and Parameters
3.2.2 Output Files and Their Interpretations
3.3 Resource Usage
4 Notes
References
Chapter 17: Somatic Single-Nucleotide Variant Calling from Single-Cell DNA Sequencing Data Using SCAN-SNV
1 Introduction
2 Software
3 Methods
3.1 Downloading Required BAM Files
3.2 Indexing BAM Files
3.3 Extraction of the SM Tag of a BAM File
3.4 Important Arguments of SCAN-SNV
3.4.1 Required Arguments
3.4.2 Optional Arguments
3.4.3 Spiked-in Arguments
3.5 Performing Variant Calling
3.6 Supplementary Materials
3.6.1 Installing SRA Toolkit and Samtools
3.6.2 Download Reference Genome and SHAPEIT´s Haplotype Reference Panel
3.6.3 Downloading and Indexing dbSNP VCF Files
4 Notes
References
Chapter 18: Copy Number Variation Detection by Single-Cell DNA Sequencing with SCOPE
1 Introduction
2 Materials
2.1 Computational Environment
2.2 Software Packages
2.3 Data Input
3 Methods
3.1 Pre-processing
3.2 Getting GC Content, Mappability, and Coverage
3.3 Quality Control
3.4 Normal Cell Identification and Normalization
3.5 Cross-Cell Segmentation and Visualization
3.6 Output
4 Notes
References
Chapter 19: Variant Annotation and Functional Prediction: SnpEff
1 Introduction
1.1 Reference Genome
1.2 Human Annotations
1.3 Clinical Annotations
2 Methods
2.1 Case 1: Functional Annotations and Human Databases
2.2 Case 2: Family Tree
2.3 Build SnpEff Database: Sars-Covid2
3 Notes
3.1 SnpEff Error and Warning Messages
3.2 Creating a Protein Sequence FASTA File
3.3 Multiple Version of RefSeq Transcripts
3.4 Show Gene or Transcript Information
References
Chapter 20: Annotating Cancer-Related Variants at Protein-Protein Interface with Structure-PPi
1 Introduction
2 Materials
2.1 Annotation Databases
2.2 Protein Structure Files
2.3 Organisms Builds and Gene Sets
2.4 Inputs and Output Formats, and a Word on Multiplicity
3 Methods
3.1 Resolving Differences in Identifiers and Sequences
3.2 Annotation Engine
3.3 Turning DNA Mutations into Protein Mutations
3.4 Mapping to Protein Sequence Locations to Structures
3.5 Calculating Neighborhoods and Interactions Surfaces
3.6 Caching
3.7 URLs
4 Structure-PPI Functionalities
4.1 Low-Level Tasks
4.2 Annotation Tasks
4.3 High-Level Tasks
5 Availability
5.1 Software Package
5.2 Containers
5.3 Command Line Tool
5.4 Structure-PPi Web Interface
5.5 REST Web Services
5.6 Performance
6 Use Cases
6.1 Annotating a VCF File
6.2 Prioritizing Protein Mutations
6.3 Protein Centric Investigation
7 Notes
References
Chapter 21: Preanalytical Variables and Sample Quality Control for Clinical Variant Analysis
1 Introduction
2 Materials
2.1 Sample Types: General Considerations and Challenges
2.2 Formalin-Fixed Paraffin-Embedded Tissue Samples
2.2.1 Tissue Fixation
2.2.2 Tumor Cell Content of Clinical FFPE Samples
2.2.3 Nucleic Acid Extraction and Purification
2.2.4 Nucleic Acid Quantitation
2.2.5 Nucleic Acid Qualification
2.3 Liquid Biopsy Samples
2.3.1 Types of Liquid Biopsy
2.3.2 Sample Collection and Storage
2.3.3 Plasma Isolation
2.3.4 Nucleic Acid Extraction
2.3.5 Nucleic Acid Quantitation and Qualification
3 Methods
3.1 NGS Library Preparation QC
3.1.1 Input Requirements
3.1.2 Library Preparation Technologies
3.1.3 Library QC
3.1.4 Library Concentration
3.2 NGS Post-sequencing Data QC
3.2.1 Sequencing Run QC
Illumina Sequencing
Ion Torrent Sequencing
3.2.2 Per-Sample Sequencing QC
4 Notes
5 Conclusions
References
Index