This thoroughly revised guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You'll learn how to combine small yet powerful command-line tools to quickly obtain, scrub, explore, and model your data. To get you started, author Jeroen Janssens provides a Docker image packed with over 80 tools--useful whether you work with Windows, macOS, or Linux.
You'll quickly discover why the command line is an agile, scalable, and extensible technology. Even if you're comfortable processing data with Python or R, you'll learn how to greatly improve your data science workflow by leveraging the command line's power. This book is ideal for data scientists, analysts, and engineers; software and machine learning engineers; and system administrators.
- Obtain data from websites, APIs, databases, and spreadsheets
- Perform scrub operations on text, CSV, HTM, XML, and JSON files
- Explore data, compute descriptive statistics, and create visualizations
- Manage your data science workflow
- Create reusable command-line tools from one-liners and existing Python or R code
- Parallelize and distribute data-intensive pipelines
- Model data with dimensionality reduction, clustering, regression, and classification algorithms
Author(s): Jeroen Janssens
Edition: 2
Publisher: O'Reilly Media
Year: 2021
Language: English
Pages: 282
Tags: Data Manipulation; Command Line Data Manipulation; Scripting; Curl; Make;
Cover
Copyright
Table of Contents
Foreword
Preface
What to Expect from This Book
Changes for the Second Edition
How to Read This Book
Who This Book Is For
Conventions Used in This Book
O’Reilly Online Learning
How to Contact Us
Acknowledgments for the Second Edition (2021)
Acknowledgments for the First Edition (2014)
Chapter 1. Introduction
Data Science Is OSEMN
Obtaining Data
Scrubbing Data
Exploring Data
Modeling Data
Interpreting Data
Intermezzo Chapters
What Is the Command Line?
Why Data Science at the Command Line?
The Command Line Is Agile
The Command Line Is Augmenting
The Command Line Is Scalable
The Command Line Is Extensible
The Command Line Is Ubiquitous
Summary
For Further Exploration
Chapter 2. Getting Started
Getting the Data
Installing the Docker Image
Essential Unix Concepts
The Environment
Executing a Command-Line Tool
Five Types of Command-Line Tools
Combining Command-Line Tools
Redirecting Input and Output
Working with Files and Directories
Managing Output
Help!
Summary
For Further Exploration
Chapter 3. Obtaining Data
Overview
Copying Local Files to the Docker Container
Downloading from the Internet
Introducing curl
Saving
Other Protocols
Following Redirects
Decompressing Files
Converting Microsoft Excel Spreadsheets to CSV
Querying Relational Databases
Calling Web APIs
Authentication
Streaming APIs
Summary
For Further Exploration
Chapter 4. Creating Command-Line Tools
Overview
Converting One-Liners into Shell Scripts
Step 1: Create a File
Step 2: Give Permission to Execute
Step 3: Define a Shebang
Step 4: Remove the Fixed Input
Step 5: Add Arguments
Step 6: Extend Your PATH
Creating Command-Line Tools with Python and R
Porting the Shell Script
Processing Streaming Data from Standard Input
Summary
For Further Exploration
Chapter 5. Scrubbing Data
Overview
Transformations, Transformations Everywhere
Plain Text
Filtering Lines
Extracting Values
Replacing and Deleting Values
CSV
Bodies and Headers and Columns, Oh My!
Performing SQL Queries on CSV
Extracting and Reordering Columns
Filtering Rows
Merging Columns
Combining Multiple CSV Files
Working with XML/HTML and JSON
Summary
For Further Exploration
Chapter 6. Project Management with Make
Overview
Introducing Make
Running Tasks
Building, for Real
Adding Dependencies
Summary
For Further Exploration
Chapter 7. Exploring Data
Overview
Inspecting Data and Its Properties
Header or Not, Here I Come
Inspect All the Data
Feature Names and Data Types
Unique Identifiers, Continuous Variables, and Factors
Computing Descriptive Statistics
Column Statistics
R One-Liners on the Shell
Creating Visualizations
Displaying Images from the Command Line
Plotting in a Rush
Creating Bar Charts
Creating Histograms
Creating Density Plots
Happy Little Accidents
Creating Scatter Plots
Creating Trend Lines
Creating Box Plots
Adding Labels
Going Beyond Basic Plots
Summary
For Further Exploration
Chapter 8. Parallel Pipelines
Overview
Serial Processing
Looping Over Numbers
Looping Over Lines
Looping Over Files
Parallel Processing
Introducing GNU Parallel
Specifying Input
Controlling the Number of Concurrent Jobs
Logging and Output
Creating Parallel Tools
Distributed Processing
Get List of Running AWS EC2 Instances
Running Commands on Remote Machines
Distributing Local Data Among Remote Machines
Processing Files on Remote Machines
Summary
For Further Exploration
Chapter 9. Modeling Data
Overview
More Wine, Please!
Dimensionality Reduction with Tapkee
Introducing Tapkee
Linear and Nonlinear Mappings
Regression with Vowpal Wabbit
Preparing the Data
Training the Model
Testing the Model
Classification with SciKit-Learn Laboratory
Preparing the Data
Running the Experiment
Parsing the Results
Summary
For Further Exploration
Chapter 10. Polyglot Data Science
Overview
Jupyter
Python
R
RStudio
Apache Spark
Summary
For Further Exploration
Chapter 11. Conclusion
Let’s Recap
Three Pieces of Advice
Be Patient
Be Creative
Be Practical
Where to Go from Here
The Command Line
Shell Programming
Python, R, and SQL
APIs
Machine Learning
Getting in Touch
Appendix. List of Command-Line Tools
alias
awk
aws
bash
bat
bc
body
cat
cd
chmod
cols
column
cowsay
cp
csv2vw
csvcut
csvgrep
csvjoin
csvlook
csvquote
csvsort
csvsql
csvstack
csvstat
curl
cut
display
dseq
echo
env
export
fc
find
fold
for
fx
git
grep
gron
head
header
history
hostname
in2csv
jq
json2csv
l
less
ls
make
man
mkdir
mv
nano
nl
parallel
paste
pbc
pip
pup
pwd
python
R
rev
rm
rush
sample
scp
sed
seq
servewd
shuf
skll
sort
split
sponge
sql2csv
ssh
sudo
tail
tapkee
tar
tee
telnet
tldr
tr
tree
trim
ts
type
uniq
unpack
unrar
unzip
vw
wc
which
xml2json
xmlstarlet
xsv
zcat
zsh
Index
About the Author
Colophon