Author(s): Eric C Anderson
Language: English
Pages: 368
Tags: practical computing Conservation and Evolutionary Genomics
List of Tables
List of Figures
Preface
Introduction
Eric's Notes of what he might do
Table of topics
I Part I: Essential Computing Skills
Overview of Essential Computing Skills
Essential Unix/Linux Terminal Knowledge
Getting a bash shell on your system
Navigating the Unix filesystem
Changing the working directory with cd
Updating your command prompt
TAB-completion for paths
Listing the contents of a directory with ls
Globbing
What makes a good file-name?
The anatomy of a Unix command
The command
The options
Arguments
Getting information about Unix commands
Handling, Manipulating, and Viewing files and streams
Creating new directories
Fundamental file-handling commands
``Viewing'' Files
Redirecting standard output: > and >>
stdin, < and |
stderr
Symbolic links
File Permissions
Editing text files at the terminal
Customizing your Environment
Appearances matter
Where are my programs/commands at?!
A Few More Important Keystrokes
A short list of additional useful commands.
Two important computing concepts
Compression
Hashing
Unix: Quick Study Guide
Shell programming
An example script
The Structure of a Bash Script
A bit more on ; and &
Variables
Assigning values to variables
Accessing values from variables
What does the shell do with the value substituted for a variable?
Double and Single Quotation Marks and Variable Substitution
One useful, fancy, variable-substitution method
Integer Arithmetic with Shell Variables
Variable arrays
Evaluate a command and substitute the result on the command line
Grouping/Collecting output from multiple commands: (commands) and { commands; }
Exit Status
Combinations of exit statuses
Loops and repetition
More Conditional Evaluation: if, then, else, and friends
Finally…positional parameters
basename and dirname two useful little utilities
bash functions
reading files line by line
Further reading
Sed, awk, and regular expressions
awk
Line-cycling, tests and actions
Column splitting, fields, -F, $, NF, print, OFS and BEGIN
A brief introduction to regular expressions
A variety of tests
Code in the action blocks
Using awk to assign to shell variables
Passing Variables into awk with -v
Writing awk scripts in files
sed
Working on remote servers
Accessing remote computers
Windows
Hummingbird
Summit
Sedna
Transferring files to remote computers
sftp (via lftp)
git
Globus
Interfacing with ``The Cloud''
Getting files from a sequencing center
tmux: the terminal multiplexer
An analogy for how tmux works
First steps with tmux
Further steps with tmux
tmux for Mac users
Installing Software on an HPCC
Modules
Miniconda
Installing Java Programs
vim: it's time to get serious with text editing
Using neovim and Nvim-R and tmux to use R well on the cluster
High Performance Computing Clusters (HPCC's)
An oversimplified, but useful, view of a computing cluster
Cluster computing and the job scheduler
Learning about the resources on your HPCC
Getting compute resources allocated to your jobs on an HPCC
Interactive sessions
Batch jobs
SLURM Job Arrays
PREPATION INTERLUDE: An in-class exercise to make sure everything is configured correctly
More Boneyard…
The Queue (SLURM/SGE/UGE)
Modules package
Compiling programs without admin privileges
Job arrays
Writing stdout and stderr to files
Breaking stuff down
II Part II: Reproducible Research Strategies
Introduction to Reproducible Research
Rstudio and Project-centered Organization
Organizing big projects
Using RStudio in workflows with remote computers and HPCCs
Keeping an RStudio project ``in sync'' with GitHub
Evaluating scripts line by line on a remote machine from within RStudio
Version control
Why use version control?
How git works
git workflow patterns
using git with Rstudio
git on the command line
A fast, furious overview of the tidyverse
Authoring reproducibly with Rmarkdown
Notebooks
References
Zotero and Rmarkdown
Bookdown
Google Docs
Using python
III Part III: Bioinformatic Analyses
Overview of Bioinformatic Analyses
DNA Sequences and Sequencing
DNA Stuff
DNA Replication with DNA Polymerase
The importance of the 3' hydroxyl…
Sanger sequencing
Illumina Sequencing by Synthesis
Library Prep Protocols
WGS
RAD-Seq methods
Amplicon Sequencing
Capture arrays, RAPTURE, etc.
Bioinformatic file formats
Sequences
FASTQ
Line 1: Illumina identifier lines
Line 4: Base quality scores
A FASTQ `tidyverse' Interlude
Comparing read 1 to read 2
FASTA
Genomic ranges
Extracting genomic ranges from a FASTA file
Downloading reference genomes from NCBI
Alignments
How might I align to thee? Let me count the ways…
Play with simple alignments
SAM Flags
The CIGAR string
The SEQ and QUAL columns
SAM File Headers
The BAM format
Quick self study
Variants
VCF Format – The Body
VCF Format – The Header
Boneyard
Segments
Conversion/Extractions between different formats
Visualization of Genomic Data
Sample Data
Genome Assembly
Alignment of sequence data to a reference genome (and associated steps)
The Journey of each DNA Fragment from Organism to Sequencing Read
Read Groups
Aligning reads with bwa
Indexing the genome for alignment
Mapping reads with bwa mem
Hold it Right There, Buddy! What about the Read Groups?
Processing alignment output with samtools
samtools subcommands
BONEYARD BELOW HERE
Preprocess ?
Quick notes to self on chaining things:
Merging BAM files
Divide and Conquer Strategies
Variant calling
Genotype Likelihoods
Basic Sketch of Genotype Likelihood Calculations
Specifics of different genotype likelihoods
Computing genotype likelihoods with three different softwares
A Directed Acyclic Graph For Genotype Likelihoods
Boneyard
Basic Handling of VCF files
bcftools
Tell me about my VCF file!
Get fragments/parts of my VCF file
Combine VCF files in various ways
Filter out variants for a variety of reasons
Bioinformatics for RAD seq data with and without a reference genome
Processing amplicon sequencing data
Genome Annotation
Whole genome alignment strategies
Mapping of scaffolds to a closely related genome
Obtaining Ancestral States from an Outgroup Genome
Using LASTZ to align coho to the chinook genome
Try on the chinook chromosomes
Explore the other parameters more
IV Part IV: Analysis of Big Variant Data
Bioinformatic analysis on variant data
V Part V: Population Genomics
Topics in pop gen
Coalescent
Measures of genetic diversity and such
Demographic inference with \partial a \partial i and moments
Balls in Boxes
Some landscape genetics
Relationship Inference
Tests for Selection
Multivariate Associations, GEA, etc.
Estimating heritability in the wild
Bibliography