The amount of data being generated today is staggering and growing. Apache Spark has emerged as the de facto tool to analyze big data and is now a critical part of the data science toolbox. Updated for Spark 3.0, this practical guide brings together Spark, statistical methods, and real-world datasets to teach you how to approach analytics problems using PySpark, Spark's Python API, and other best practices in Spark programming.
Data scientists Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills offer an introduction to the Spark ecosystem, then dive into patterns that apply common techniques-including classification, clustering, collaborative filtering, and anomaly detection, to fields such as genomics, security, and finance. This updated edition also covers NLP and image processing.
If you have a basic understanding of machine learning and statistics and you program in Python, this book will get you started with large-scale data analysis.
• Familiarize yourself with Spark's programming model and ecosystem
• Learn general approaches in data science
• Examine complete implementations that analyze large public datasets
• Discover which machine learning tools make sense for particular problems
• Explore code that can be adapted to many uses
Author(s): Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills
Edition: 1
Publisher: O'Reilly Media
Year: 2022
Language: English
Commentary: Vector PDF
Pages: 233
City: Sebastopol, CA
Tags: Machine Learning; Data Analysis; Deep Learning; Decision Trees; Anomaly Detection; Big Data; Recommender Systems; Clustering; Predictive Models; Apache Spark; Risk Assessment; Finance; Geospatial Data; Genomics; PySpark; Random Forest; Latent Dirichlet Allocation; Spark NLP; MLflow
Cover
Copyright
Table of Contents
Preface
Why Did We Write This Book Now?
How This Book Is Organized
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Chapter 1. Analyzing Big Data
Working with Big Data
Introducing Apache Spark and PySpark
Components
PySpark
Ecosystem
Spark 3.0
PySpark Addresses Challenges of Data Science
Where to Go from Here
Chapter 2. Introduction to Data Analysis with PySpark
Spark Architecture
Installing PySpark
Setting Up Our Data
Analyzing Data with the DataFrame API
Fast Summary Statistics for DataFrames
Pivoting and Reshaping DataFrames
Joining DataFrames and Selecting Features
Scoring and Model Evaluation
Where to Go from Here
Chapter 3. Recommending Music and the Audioscrobbler Dataset
Setting Up the Data
Our Requirements for a Recommender System
Alternating Least Squares Algorithm
Preparing the Data
Building a First Model
Spot Checking Recommendations
Evaluating Recommendation Quality
Computing AUC
Hyperparameter Selection
Making Recommendations
Where to Go from Here
Chapter 4. Making Predictions with Decision Trees and Decision Forests
Decision Trees and Forests
Preparing the Data
Our First Decision Tree
Decision Tree Hyperparameters
Tuning Decision Trees
Categorical Features Revisited
Random Forests
Making Predictions
Where to Go from Here
Chapter 5. Anomaly Detection with K-means Clustering
K-means Clustering
Identifying Anomalous Network Traffic
KDD Cup 1999 Dataset
A First Take on Clustering
Choosing k
Visualization with SparkR
Feature Normalization
Categorical Variables
Using Labels with Entropy
Clustering in Action
Where to Go from Here
Chapter 6. Understanding Wikipedia with LDA and Spark NLP
Latent Dirichlet Allocation
LDA in PySpark
Getting the Data
Spark NLP
Setting Up Your Environment
Parsing the Data
Preparing the Data Using Spark NLP
TF-IDF
Computing the TF-IDFs
Creating Our LDA Model
Where to Go from Here
Chapter 7. Geospatial and Temporal Data Analysis on Taxi Trip Data
Preparing the Data
Converting Datetime Strings to Timestamps
Handling Invalid Records
Geospatial Analysis
Intro to GeoJSON
GeoPandas
Sessionization in PySpark
Building Sessions: Secondary Sorts in PySpark
Where to Go from Here
Chapter 8. Estimating Financial Risk
Terminology
Methods for Calculating VaR
Variance-Covariance
Historical Simulation
Monte Carlo Simulation
Our Model
Getting the Data
Preparing the Data
Determining the Factor Weights
Sampling
The Multivariate Normal Distribution
Running the Trials
Visualizing the Distribution of Returns
Where to Go from Here
Chapter 9. Analyzing Genomics Data and the BDG Project
Decoupling Storage from Modeling
Setting Up ADAM
Introduction to Working with Genomics Data Using ADAM
File Format Conversion with the ADAM CLI
Ingesting Genomics Data Using PySpark and ADAM
Predicting Transcription Factor Binding Sites from ENCODE Data
Where to Go from Here
Chapter 10. Image Similarity Detection with Deep Learning and PySpark LSH
PyTorch
Installation
Preparing the Data
Resizing Images Using PyTorch
Deep Learning Model for Vector Representation of Images
Image Embeddings
Import Image Embeddings into PySpark
Image Similarity Search Using PySpark LSH
Nearest Neighbor Search
Where to Go from Here
Chapter 11. Managing the Machine Learning Lifecycle with MLflow
Machine Learning Lifecycle
MLflow
Experiment Tracking
Managing and Serving ML Models
Creating and Using MLflow Projects
Where to Go from Here
Index
About the Authors
Colophon