Data Algorithms with Spark: Recipes and Design Patterns for Scaling Up using PySpark

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Apache Spark's speed, ease of use, sophisticated analytics, and multilanguage support makes practical knowledge of this cluster-computing framework a required skill for data engineers and data scientists. With this hands-on guide, anyone looking for an introduction to Spark will learn practical algorithms and examples using PySpark. In each chapter, author Mahmoud Parsian shows you how to solve a data problem with a set of Spark transformations and algorithms. You'll learn how to tackle problems involving ETL, design patterns, machine learning algorithms, data partitioning, and genomics analysis. Each detailed recipe includes PySpark algorithms using the PySpark driver and shell script. With this book, you will: • Learn how to select Spark transformations for optimized solutions • Explore powerful transformations and reductions including reduceByKey(), combineByKey(), and mapPartitions() • Understand data partitioning for optimized queries • Build and apply a model using PySpark design patterns • Apply motif-finding algorithms to graph data • Analyze graph data by using the GraphFrames API • Apply PySpark algorithms to clinical and genomics data • Learn how to use and apply feature engineering in ML algorithms • Understand and use practical and pragmatic data design patterns

Author(s): Mahmoud Parsian
Edition: 1
Publisher: O'Reilly Media
Year: 2022

Language: English
Commentary: Vector PDF
Pages: 435
City: Sebastopol, CA
Tags: Algorithms; Data Analysis; Apache Spark; Feature Engineering; HDFS; JSON; CSV; Graph Algorithms; PySpark; DNA; Ranking; PageRank; Data Design Patterns

Cover
Copyright
Table of Contents
Foreword
Preface
Why I Wrote This Book
Who This Book Is For
How This Book Is Organized
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Part I. Fundamentals
Chapter 1. Introduction to Spark and PySpark
Why Spark for Data Analytics
The Spark Ecosystem
Spark Architecture
The Power of PySpark
PySpark Architecture
Spark Data Abstractions
RDD Examples
Spark RDD Operations
DataFrame Examples
Using the PySpark Shell
Launching the PySpark Shell
Creating an RDD from a Collection
Aggregating and Merging Values of Keys
Filtering an RDD’s Elements
Grouping Similar Keys
Aggregating Values for Similar Keys
ETL Example with DataFrames
Extraction
Transformation
Loading
Summary
Chapter 2. Transformations in Action
The DNA Base Count Example
The DNA Base Count Problem
FASTA Format
Sample Data
DNA Base Count Solution 1
Step 1: Create an RDD[String] from the Input
Step 2: Define a Mapper Function
Step 3: Find the Frequencies of DNA Letters
Pros and Cons of Solution 1
DNA Base Count Solution 2
Step 1: Create an RDD[String] from the Input
Step 2: Define a Mapper Function
Step 3: Find the Frequencies of DNA Letters
Pros and Cons of Solution 2
DNA Base Count Solution 3
The mapPartitions() Transformation
Step 1: Create an RDD[String] from the Input
Step 2: Define a Function to Handle a Partition
Step 3: Apply the Custom Function to Each Partition
Pros and Cons of Solution 3
Summary
Chapter 3. Mapper Transformations
Data Abstractions and Mappers
What Are Transformations?
Lazy Transformations
The map() Transformation
DataFrame Mapper
The flatMap() Transformation
map() Versus flatMap()
Apply flatMap() to a DataFrame
The mapValues() Transformation
The flatMapValues() Transformation
The mapPartitions() Transformation
Handling Empty Partitions
Benefits and Drawbacks
DataFrames and mapPartitions() Transformation
Summary
Chapter 4. Reductions in Spark
Creating Pair RDDs
Reduction Transformations
Spark’s Reductions
Simple Warmup Example
Solving with reduceByKey()
Solving with groupByKey()
Solving with aggregateByKey()
Solving with combineByKey()
What Is a Monoid?
Monoid and Non-Monoid Examples
The Movie Problem
Input Dataset to Analyze
The aggregateByKey() Transformation
First Solution Using aggregateByKey()
Second Solution Using aggregateByKey()
Complete PySpark Solution Using groupByKey()
Complete PySpark Solution Using reduceByKey()
Complete PySpark Solution Using combineByKey()
The Shuffle Step in Reductions
Shuffle Step for groupByKey()
Shuffle Step for reduceByKey()
Summary
Part II. Working with Data
Chapter 5. Partitioning Data
Introduction to Partitions
Partitions in Spark
Managing Partitions
Default Partitioning
Explicit Partitioning
Physical Partitioning for SQL Queries
Physical Partitioning of Data in Spark
Partition as Text Format
Partition as Parquet Format
How to Query Partitioned Data
Amazon Athena Example
Summary
Chapter 6. Graph Algorithms
Introduction to Graphs
The GraphFrames API
How to Use GraphFrames
GraphFrames Functions and Attributes
GraphFrames Algorithms
Finding Triangles
Motif Finding
Real-World Applications
Gene Analysis
Social Recommendations
Facebook Circles
Connected Components
Analyzing Flight Data
Summary
Chapter 7. Interacting with External Data Sources
Relational Databases
Reading from a Database
Writing a DataFrame to a Database
Reading Text Files
Reading and Writing CSV Files
Reading CSV Files
Writing CSV Files
Reading and Writing JSON Files
Reading JSON Files
Writing JSON Files
Reading from and Writing to Amazon S3
Reading from Amazon S3
Writing to Amazon S3
Reading and Writing Hadoop Files
Reading Hadoop Text Files
Writing Hadoop Text Files
Reading and Writing HDFS SequenceFiles
Reading and Writing Parquet Files
Writing Parquet Files
Reading Parquet Files
Reading and Writing Avro Files
Reading Avro Files
Writing Avro Files
Reading from and Writing to MS SQL Server
Writing to MS SQL Server
Reading from MS SQL Server
Reading Image Files
Creating a DataFrame from Images
Summary
Chapter 8. Ranking Algorithms
Rank Product
Calculation of the Rank Product
Formalizing Rank Product
Rank Product Example
PySpark Solution
PageRank
PageRank’s Iterative Computation
Custom PageRank in PySpark Using RDDs
Custom PageRank in PySpark Using an Adjacency Matrix
PageRank with GraphFrames
Summary
Part III. Data Design Patterns
Chapter 9. Classic Data Design Patterns
Input-Map-Output
RDD Solution
DataFrame Solution
Flat Mapper functionality
Input-Filter-Output
RDD Solution
DataFrame Solution
DataFrame Filter
Input-Map-Reduce-Output
RDD Solution
DataFrame Solution
Input-Multiple-Maps-Reduce-Output
RDD Solution
DataFrame Solution
Input-Map-Combiner-Reduce-Output
Input-MapPartitions-Reduce-Output
Inverted Index
Problem Statement
Input
Output
PySpark Solution
Summary
Chapter 10. Practical Data Design Patterns
In-Mapper Combining
Basic MapReduce Algorithm
In-Mapper Combining per Record
In-Mapper Combining per Partition
Top-10
Top-N Formalized
PySpark Solution
Finding the Bottom 10
MinMax
Solution 1: Classic MapReduce
Solution 2: Sorting
Solution 3: Spark’s mapPartitions()
The Composite Pattern and Monoids
Monoids
Monoidal and Non-Monoidal Examples
Non-Monoid MapReduce Example
Monoid MapReduce Example
PySpark Implementation of Monoidal Mean
Functors and Monoids
Conclusion on Using Monoids
Binning
Sorting
Summary
Chapter 11. Join Design Patterns
Introduction to the Join Operation
Join in MapReduce
Map Phase
Reducer Phase
Implementation in PySpark
Map-Side Join Using RDDs
Map-Side Join Using DataFrames
Step 1: Create Cache for Airports
Step 2: Create Cache for Airlines
Step 3: Create Facts Table
Step 4: Apply Map-Side Join
Efficient Joins Using Bloom Filters
Introduction to Bloom Filters
A Simple Bloom Filter Example
Bloom Filters in Python
Using Bloom Filters in PySpark
Summary
Chapter 12. Feature Engineering in PySpark
Introduction to Feature Engineering
Adding New Features
Applying UDFs
Creating Pipelines
Binarizing Data
Imputation
Tokenization
Tokenizer
RegexTokenizer
Tokenization with a Pipeline
Standardization
Normalization
Scaling a Column Using a Pipeline
Using MinMaxScaler on Multiple Columns
Normalization Using Normalizer
String Indexing
Applying StringIndexer to a Single Column
Applying StringIndexer to Several Columns
Vector Assembly
Bucketing
Bucketizer
QuantileDiscretizer
Logarithm Transformation
One-Hot Encoding
TF-IDF
FeatureHasher
SQLTransformer
Summary
Index
About the Author
Colophon