Building reproducible analytical pipelines with R

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Build reproducible analytical pipelines to output consistent, high-quality data products using R, Github and Docker. Learn about functional and literate programming to keep your code concise, easier to test and share and easily understandable by others by packaging it. Run your pipelines on Github Actions and focus on what matters: analysing data! This book will not teach you about the R programming language, machine learning, statistics or visualisation. The goal is to teach you a set of tools, practices and project management techniques that should make your projects easier to reproduce, replicate and retrace. These tools and techniques can be used right from the start of your project at a minimal cost, such that once you’re done with the analysis, you’re also done with making the project reproducible. Your projects are going to be reproducible simply because they were engineered, from the start, to be reproducible. Building on your knowledge of R, you will learn about several packages to build reproducible analytical pipelines: {renv}, {targets}, {fusen} but also about trunk-based development with Git and Github, and Docker.

Author(s): Bruno Rodrigues
Publisher: Independently published
Year: 2023

Language: English
Pages: 522

Welcome!
How using a few ideas from software engineering can help data scientists, analysts and researchers write reliable code
Preface
1 Introduction
1.1 Who is this book for?
1.2 What is the aim of this book?
1.3 Prerequisites
1.4 What actually is reproducibility?
1.4.1 Using open-source tools to build a RAP is a hard requirement
1.4.2 There are hidden dependencies that can hinder the reproducibility of a project
1.4.3 The requirements of a RAP
1.5 Are there different types of reproducibility?
2 Before we start
2.1 Essential knowledge
3 Project start
3.1 Housing in Luxembourg
3.2 Saving trapped data from Excel
3.3 Analysing the data
3.4 Your project is not done
3.4.1 How easy would it be for someone else to rerun the analysis?
3.4.2 How easy would it be to update the project?
3.4.3 How easy would it be to reuse this code for another project?
3.4.4 What guarantee do we have that the output is stable through time?
3.5 Conclusion
4 Version control with Git
4.1 Installing Git and opening a Github account
4.2 Git superbasics
4.3 Git and Github
4.4 Getting to know Github
4.5 Conclusion
5 Collaborating using Trunk-based development
5.1 Collaborating as a team
5.1.1 TBD basics
5.1.2 Handling conflicts
5.1.3 Make sure you blame the right person
5.1.4 Simplified trunk-based development
5.1.5 Conclusion
5.2 Contributing to public repositories
5.3 Further reading
6 Functional programming
6.1 Introduction
6.1.1 The state of your program
6.1.2 Predictable functions
6.1.3 Referentially transparent and pure functions
6.2 Writing good functions
6.2.1 Functions are first-class objects
6.2.2 Optional arguments
6.2.3 Safe functions
6.2.4 Recursive functions
6.2.5 Anonymous functions
6.2.6 The Unix philosophy applied to R
6.3 Lists: a powerful data-structure
6.3.1 Lists all the way down
6.3.2 Lists can hold many things
6.3.3 Lists as the cure to loops
6.3.4 Data frames
6.4 Functional programming in R
6.4.1 Base capabilities
6.4.2 purrr
6.4.3 withr
6.5 Conclusion
7 Literate programming
7.1 A quick history of literate programming
7.2 {knitr} basics
7.2.1 Set up
7.2.2 Markdown ultrabasics
7.3 Keeping it DRY
7.3.1 Generating R Markdown code from code
7.3.2 Tables in R Markdown documents
7.3.3 Parametrized reports
7.4 Conclusion
8 Conclusion of part 1
9 Rewriting our project
9.1 An Rmd for cleaning the data
9.2 An Rmd for analysing the data
9.3 Conclusion
10 Basic reproducibility: freezing packages
10.1 Recording packages’ version with {renv}
10.1.1 Daily {renv} usage
10.1.2 Collaborating with {renv}
10.1.3 {renv}’s shortcomings
10.2 Becoming an R-cheologist
10.3 Conclusion
11 Packaging your code
11.1 Benefits of packages
11.2 {fusen} quickstart
11.3 Turning our Rmds into a package
11.4 Including datasets
11.5 Installing and sharing the package
11.5.1 Code is hosted
11.5.2 Code cannot be hosted
11.5.3 Marketing your work
11.6 Conclusion
12 Testing your code
12.1 Unit testing
12.2 Assertive programming
12.3 Test-driven development
12.4 Code coverage
12.5 Conclusion
13 Build automation with targets
13.1 Introduction
13.2 {targets} quick-start
13.2.1 _targets.R’s anatomy
13.3 A pipeline is a composition of pure functions
13.4 Handling files
13.5 The dependency graph
13.6 Running the pipeline in parallel
13.7 {targets} and RMarkdown (or Quarto)
13.8 Rewriting our project as a pipeline and {renv} redux
13.9 Some little tips before concluding
13.9.1 Load every target at once
13.9.2 Get metadata information on your pipeline
13.9.3 Make a target (or the whole pipeline) outdated
13.9.4 Customize the network’s visualisation
13.9.5 Use targets from one pipeline in another project
13.9.6 Understanding this cryptic error message
13.10 Conclusion
14 Reproducible analytical pipelines with Docker
14.1 What is Docker?
14.2 A primer on Linux
14.3 First steps with Docker
14.4 The Rocker project
14.5 Dockerizing projects
14.6 Dockerizing development environments
14.6.1 Creating a base image for development
14.6.2 Sharing images through Docker Hub
14.6.3 Sharing a compressed archive of your image
14.7 Some issues of relying on Docker
14.7.1 The problems of relying so much on Docker
14.7.2 Is Docker enough?
14.8 Conclusion
15 Continuous integration and continuous deployment
15.1 CI/CD quickstart for R programmers (and others)
15.2 Running a RAP using Github Actions
15.3 Craft a dockerized dev env with GA
15.4 Run a RAP using a dockerized dev env on GA
15.5 Conclusion
16 Conclusion of part 2
17 The end
“So what?”
References