Learn how to build end-to-end scalable machine learning solutions with Apache Spark. With this practical guide, author Adi Polak introduces data and ML practitioners to creative solutions that supersede today's traditional methods. You'll learn a more holistic approach that takes you beyond specific requirements and organizational goals--allowing data and ML practitioners to collaborate and understand each other better.
Scaling Machine Learning with Spark examines several technologies for building end-to-end distributed ML workflows based on the Apache Spark ecosystem with Spark MLlib, MLflow, TensorFlow, and PyTorch. If you're a data scientist who works with machine learning, this book shows you when and why to use each technology.
You will:
• Explore machine learning, including distributed computing concepts and terminology
• Manage the ML lifecycle with MLflow
• Ingest data and perform basic preprocessing with Spark
• Explore feature engineering, and use Spark to extract features
• Train a model with MLlib and build a pipeline to reproduce it
• Build a data system to combine the power of Spark with deep learning
• Get a step-by-step example of working with distributed TensorFlow
• Use PyTorch to scale machine learning and its internal architecture
Author(s): Adi Polak
Edition: 1
Publisher: O'Reilly Media
Year: 2023
Language: English
Commentary: Publisher's PDF
Pages: 291
City: Sebastopol, CA
Tags: Machine Learning; Unsupervised Learning; Supervised Learning; Python; Clustering; Apache Spark; Feature Engineering; TensorFlow; Distributed Systems; Monitoring; Pipelines; Deployment; Hyperparameter Tuning; scikit-learn; Ensemble Learning; PyTorch; PySpark; Spark MLlib; Descriptive Statistics; Workflows; Data Ingestion; Data Preprocessing; MLflow; Petastorm
Cover
Copyright
Table of Contents
Preface
Who Should Read This Book?
Do You Need Distributed Machine Learning?
Navigating This Book
What Is Not Covered
The Environment and Tools
The Tools
The Datasets
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Chapter 1. Distributed Machine Learning Terminology and Concepts
The Stages of the Machine Learning Workflow
Tools and Technologies in the Machine Learning Pipeline
Distributed Computing Models
General-Purpose Models
Dedicated Distributed Computing Models
Introduction to Distributed Systems Architecture
Centralized Versus Decentralized Systems
Interaction Models
Communication in a Distributed Setting
Introduction to Ensemble Methods
High Versus Low Bias
Types of Ensemble Methods
Distributed Training Topologies
The Challenges of Distributed Machine Learning Systems
Performance
Resource Management
Fault Tolerance
Privacy
Portability
Setting Up Your Local Environment
Chapters 2–6 Tutorials Environment
Chapters 7–10 Tutorials Environment
Summary
Chapter 2. Introduction to Spark and PySpark
Apache Spark Architecture
Intro to PySpark
Apache Spark Basics
Software Architecture
PySpark and Functional Programming
Executing PySpark Code
pandas DataFrames Versus Spark DataFrames
Scikit-Learn Versus MLlib
Summary
Chapter 3. Managing the Machine Learning Experiment Lifecycle with MLflow
Machine Learning Lifecycle Management Requirements
What Is MLflow?
Software Components of the MLflow Platform
Users of the MLflow Platform
MLflow Components
MLflow Tracking
MLflow Projects
MLflow Models
MLflow Model Registry
Using MLflow at Scale
Summary
Chapter 4. Data Ingestion, Preprocessing, and Descriptive Statistics
Data Ingestion with Spark
Working with Images
Working with Tabular Data
Preprocessing Data
Preprocessing Versus Processing
Why Preprocess the Data?
Data Structures
MLlib Data Types
Preprocessing with MLlib Transformers
Preprocessing Image Data
Save the Data and Avoid the Small Files Problem
Descriptive Statistics: Getting a Feel for the Data
Calculating Statistics
Descriptive Statistics with Spark Summarizer
Data Skewness
Correlation
Summary
Chapter 5. Feature Engineering
Features and Their Impact on Models
MLlib Featurization Tools
Extractors
Selectors
Example: Word2Vec
The Image Featurization Process
Understanding Image Manipulation
Extracting Features with Spark APIs
The Text Featurization Process
Bag-of-Words
TF-IDF
N-Gram
Additional Techniques
Enriching the Dataset
Summary
Chapter 6. Training Models with Spark MLlib
Algorithms
Supervised Machine Learning
Classification
Regression
Unsupervised Machine Learning
Frequent Pattern Mining
Clustering
Evaluating
Supervised Evaluators
Unsupervised Evaluators
Hyperparameters and Tuning Experiments
Building a Parameter Grid
Splitting the Data into Training and Test Sets
Cross-Validation: A Better Way to Test Your Models
Machine Learning Pipelines
Constructing a Pipeline
How Does Splitting Work with the Pipeline API?
Persistence
Summary
Chapter 7. Bridging Spark and Deep Learning Frameworks
The Two Clusters Approach
Implementing a Dedicated Data Access Layer
Features of a DAL
Selecting a DAL
What Is Petastorm?
SparkDatasetConverter
Petastorm as a Parquet Store
Project Hydrogen
Barrier Execution Mode
Accelerator-Aware Scheduling
A Brief Introduction to the Horovod Estimator API
Summary
Chapter 8. TensorFlow Distributed Machine Learning Approach
A Quick Overview of TensorFlow
What Is a Neural Network?
TensorFlow Cluster Process Roles and Responsibilities
Loading Parquet Data into a TensorFlow Dataset
An Inside Look at TensorFlow’s Distributed Machine Learning Strategies
ParameterServerStrategy
CentralStorageStrategy: One Machine, Multiple Processors
MirroredStrategy: One Machine, Multiple Processors, Local Copy
MultiWorkerMirroredStrategy: Multiple Machines, Synchronous
TPUStrategy
What Things Change When You Switch Strategies?
Training APIs
Keras API
Custom Training Loop
Estimator API
Putting It All Together
Troubleshooting
Summary
Chapter 9. PyTorch Distributed Machine Learning Approach
A Quick Overview of PyTorch Basics
Computation Graph
PyTorch Mechanics and Concepts
PyTorch Distributed Strategies for Training Models
Introduction to PyTorch’s Distributed Approach
Distributed Data-Parallel Training
RPC-Based Distributed Training
Communication Topologies in PyTorch (c10d)
What Can We Do with PyTorch’s Low-Level APIs?
Loading Data with PyTorch and Petastorm
Troubleshooting Guidance for Working with Petastorm and Distributed PyTorch
The Enigma of Mismatched Data Types
The Mystery of Straggling Workers
How Does PyTorch Differ from TensorFlow?
Summary
Chapter 10. Deployment Patterns for Machine Learning Models
Deployment Patterns
Pattern 1: Batch Prediction
Pattern 2: Model-in-Service
Pattern 3: Model-as-a-Service
Determining Which Pattern to Use
Production Software Requirements
Monitoring Machine Learning Models in Production
Data Drift
Model Drift, Concept Drift
Distributional Domain Shift (the Long Tail)
What Metrics Should I Monitor in Production?
How Do I Measure Changes Using My Monitoring System?
What It Looks Like in Production
The Production Feedback Loop
Deploying with MLlib
Production Machine Learning Pipelines with Structured Streaming
Deploying with MLflow
Defining an MLflow Wrapper
Deploying the Model as a Microservice
Loading the Model as a Spark UDF
How to Develop Your System Iteratively
Summary
Index
About the Author
Colophon