Stream Processing with Apache Spark: Mastering Structured Streaming and Spark Streaming

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Before you can build analytics tools to gain quick insights, you first need to know how to process data in real time. With this practical guide, developers familiar with Apache Spark will learn how to put this in-memory framework to use for streaming data. You’ll discover how Spark enables you to write streaming jobs in almost the same way you write batch jobs. Authors Gerard Maas and François Garillot help you explore the theoretical underpinnings of Apache Spark. This comprehensive guide features two sections that compare and contrast the streaming APIs Spark now supports: the original Spark Streaming library and the newer Structured Streaming API. • Learn fundamental stream processing concepts and examine different streaming architectures • Explore Structured Streaming through practical examples; learn different aspects of stream processing in detail • Create and operate streaming jobs and applications with Spark Streaming; integrate Spark Streaming with other Spark APIs • Learn advanced Spark Streaming techniques, including approximation algorithms and machine learning algorithms • Compare Apache Spark to other stream processing projects, including Apache Storm, Apache Flink, and Apache Kafka Streams

Author(s): Gerard Maas, Francois Garillot
Edition: 1
Publisher: O'Reilly Media
Year: 2019

Language: English
Commentary: True PDF
Pages: 452
City: Sebastopol, CA
Tags: Cloud Computing; Machine Learning; Analytics; Apache Spark; Apache Storm; Monitoring; Stream Processing; Apache Kafka; Batch Processing; Clusters; Apache Flink; Spark SQL; Resilient Distributed Datasets; Performance Tuning; Spark Streaming; Lambda Architecture; Distributed Processing; Kappa Architecture; Structured Streaming

Copyright
Table of Contents
Foreword
Preface
Who Should Read This Book?
Installing Spark
Learning Scala
The Way Ahead
Bibliography
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
From Gerard
From François
Part I. Fundamentals of Stream Processing with Apache Spark
Chapter 1. Introducing Stream Processing
What Is Stream Processing?
Batch Versus Stream Processing
The Notion of Time in Stream Processing
The Factor of Uncertainty
Some Examples of Stream Processing
Scaling Up Data Processing
MapReduce
The Lesson Learned: Scalability and Fault Tolerance
Distributed Stream Processing
Stateful Stream Processing in a Distributed System
Introducing Apache Spark
The First Wave: Functional APIs
The Second Wave: SQL
A Unified Engine
Spark Components
Spark Streaming
Structured Streaming
Where Next?
Chapter 2. Stream-Processing Model
Sources and Sinks
Immutable Streams Defined from One Another
Transformations and Aggregations
Window Aggregations
Tumbling Windows
Sliding Windows
Stateless and Stateful Processing
Stateful Streams
An Example: Local Stateful Computation in Scala
A Stateless Definition of the Fibonacci Sequence as a Stream Transformation
Stateless or Stateful Streaming
The Effect of Time
Computing on Timestamped Events
Timestamps as the Provider of the Notion of Time
Event Time Versus Processing Time
Computing with a Watermark
Summary
Chapter 3. Streaming Architectures
Components of a Data Platform
Architectural Models
The Use of a Batch-Processing Component in a Streaming Application
Referential Streaming Architectures
The Lambda Architecture
The Kappa Architecture
Streaming Versus Batch Algorithms
Streaming Algorithms Are Sometimes Completely Different in Nature
Streaming Algorithms Can’t Be Guaranteed to Measure Well Against Batch Algorithms
Summary
Chapter 4. Apache Spark as a Stream-Processing Engine
The Tale of Two APIs
Spark’s Memory Usage
Failure Recovery
Lazy Evaluation
Cache Hints
Understanding Latency
Throughput-Oriented Processing
Spark’s Polyglot API
Fast Implementation of Data Analysis
To Learn More About Spark
Summary
Chapter 5. Spark’s Distributed Processing Model
Running Apache Spark with a Cluster Manager
Examples of Cluster Managers
Spark’s Own Cluster Manager
Understanding Resilience and Fault Tolerance in a Distributed System
Fault Recovery
Cluster Manager Support for Fault Tolerance
Data Delivery Semantics
Microbatching and One-Element-at-a-Time
Microbatching: An Application of Bulk-Synchronous Processing
One-Record-at-a-Time Processing
Microbatching Versus One-at-a-Time: The Trade-Offs
Bringing Microbatch and One-Record-at-a-Time Closer Together
Dynamic Batch Interval
Structured Streaming Processing Model
The Disappearance of the Batch Interval
Chapter 6. Spark’s Resilience Model
Resilient Distributed Datasets in Spark
Spark Components
Spark’s Fault-Tolerance Guarantees
Task Failure Recovery
Stage Failure Recovery
Driver Failure Recovery
Summary
Appendix A. References for Part I
Part II. Structured Streaming
Chapter 7. Introducing Structured Streaming
First Steps with Structured Streaming
Batch Analytics
Streaming Analytics
Connecting to a Stream
Preparing the Data in the Stream
Operations on Streaming Dataset
Creating a Query
Start the Stream Processing
Exploring the Data
Summary
Chapter 8. The Structured Streaming Programming Model
Initializing Spark
Sources: Acquiring Streaming Data
Available Sources
Transforming Streaming Data
Streaming API Restrictions on the DataFrame API
Sinks: Output the Resulting Data
format
outputMode
queryName
option
options
trigger
start()
Summary
Chapter 9. Structured Streaming in Action
Consuming a Streaming Source
Application Logic
Writing to a Streaming Sink
Summary
Chapter 10. Structured Streaming Sources
Understanding Sources
Reliable Sources Must Be Replayable
Sources Must Provide a Schema
Available Sources
The File Source
Specifying a File Format
Common Options
Common Text Parsing Options (CSV, JSON)
JSON File Source Format
CSV File Source Format
Parquet File Source Format
Text File Source Format
The Kafka Source
Setting Up a Kafka Source
Selecting a Topic Subscription Method
Configuring Kafka Source Options
Kafka Consumer Options
The Socket Source
Configuration
Operations
The Rate Source
Options
Chapter 11. Structured Streaming Sinks
Understanding Sinks
Available Sinks
Reliable Sinks
Sinks for Experimentation
The Sink API
Exploring Sinks in Detail
The File Sink
Using Triggers with the File Sink
Common Configuration Options Across All Supported File Formats
Common Time and Date Formatting (CSV, JSON)
The CSV Format of the File Sink
The JSON File Sink Format
The Parquet File Sink Format
The Text File Sink Format
The Kafka Sink
Understanding the Kafka Publish Model
Using the Kafka Sink
The Memory Sink
Output Modes
The Console Sink
Options
Output Modes
The Foreach Sink
The ForeachWriter Interface
TCP Writer Sink: A Practical ForeachWriter Example
The Moral of this Example
Troubleshooting ForeachWriter Serialization Issues
Chapter 12. Event Time–Based Stream Processing
Understanding Event Time in Structured Streaming
Using Event Time
Processing Time
Watermarks
Time-Based Window Aggregations
Defining Time-Based Windows
Understanding How Intervals Are Computed
Using Composite Aggregation Keys
Tumbling and Sliding Windows
Record Deduplication
Summary
Chapter 13. Advanced Stateful Operations
Example: Car Fleet Management
Understanding Group with State Operations
Internal State Flow
Using MapGroupsWithState
Using FlatMapGroupsWithState
Output Modes
Managing State Over Time
Summary
Chapter 14. Monitoring Structured Streaming Applications
The Spark Metrics Subsystem
Structured Streaming Metrics
The StreamingQuery Instance
Getting Metrics with StreamingQueryProgress
The StreamingQueryListener Interface
Implementing a StreamingQueryListener
Chapter 15. Experimental Areas: Continuous Processing and Machine Learning
Continuous Processing
Understanding Continuous Processing
Using Continuous Processing
Limitations
Machine Learning
Learning Versus Exploiting
Applying a Machine Learning Model to a Stream
Example: Estimating Room Occupancy by Using Ambient Sensors
Online Training
Appendix B. References for Part II
Part III. Spark Streaming
Chapter 16. Introducing Spark Streaming
The DStream Abstraction
DStreams as a Programming Model
DStreams as an Execution Model
The Structure of a Spark Streaming Application
Creating the Spark Streaming Context
Defining a DStream
Defining Output Operations
Starting the Spark Streaming Context
Stopping the Streaming Process
Summary
Chapter 17. The Spark Streaming Programming Model
RDDs as the Underlying Abstraction for DStreams
Understanding DStream Transformations
Element-Centric DStream Transformations
RDD-Centric DStream Transformations
Counting
Structure-Changing Transformations
Summary
Chapter 18. The Spark Streaming Execution Model
The Bulk-Synchronous Architecture
The Receiver Model
The Receiver API
How Receivers Work
The Receiver’s Data Flow
The Internal Data Resilience
Receiver Parallelism
Balancing Resources: Receivers Versus Processing Cores
Achieving Zero Data Loss with the Write-Ahead Log
The Receiverless or Direct Model
Summary
Chapter 19. Spark Streaming Sources
Types of Sources
Basic Sources
Receiver-Based Sources
Direct Sources
Commonly Used Sources
The File Source
How It Works
The Queue Source
How It Works
Using a Queue Source for Unit Testing
A Simpler Alternative to the Queue Source: The ConstantInputDStream
The Socket Source
How It Works
The Kafka Source
Using the Kafka Source
How It Works
Where to Find More Sources
Chapter 20. Spark Streaming Sinks
Output Operations
Built-In Output Operations
print
saveAsxyz
foreachRDD
Using foreachRDD as a Programmable Sink
Third-Party Output Operations
Chapter 21. Time-Based Stream Processing
Window Aggregations
Tumbling Windows
Window Length Versus Batch Interval
Sliding Windows
Sliding Windows Versus Batch Interval
Sliding Windows Versus Tumbling Windows
Using Windows Versus Longer Batch Intervals
Window Reductions
reduceByWindow
reduceByKeyAndWindow
countByWindow
countByValueAndWindow
Invertible Window Aggregations
Slicing Streams
Summary
Chapter 22. Arbitrary Stateful Streaming Computation
Statefulness at the Scale of a Stream
updateStateByKey
Limitation of updateStateByKey
Performance
Memory Usage
Introducing Stateful Computation with mapwithState
Using mapWithState
Event-Time Stream Computation Using mapWithState
Chapter 23. Working with Spark SQL
Spark SQL
Accessing Spark SQL Functions from Spark Streaming
Example: Writing Streaming Data to Parquet
Dealing with Data at Rest
Using Join to Enrich the Input Stream
Join Optimizations
Updating Reference Datasets in a Streaming Application
Enhancing Our Example with a Reference Dataset
Summary
Chapter 24. Checkpointing
Understanding the Use of Checkpoints
Checkpointing DStreams
Recovery from a Checkpoint
Limitations
The Cost of Checkpointing
Checkpoint Tuning
Chapter 25. Monitoring Spark Streaming
The Streaming UI
Understanding Job Performance Using the Streaming UI
Input Rate Chart
Scheduling Delay Chart
Processing Time Chart
Total Delay Chart
Batch Details
The Monitoring REST API
Using the Monitoring REST API
Information Exposed by the Monitoring REST API
The Metrics Subsystem
The Internal Event Bus
Interacting with the Event Bus
Summary
Chapter 26. Performance Tuning
The Performance Balance of Spark Streaming
The Relationship Between Batch Interval and Processing Delay
The Last Moments of a Failing Job
Going Deeper: Scheduling Delay and Processing Delay
Checkpoint Influence in Processing Time
External Factors that Influence the Job’s Performance
How to Improve Performance?
Tweaking the Batch Interval
Limiting the Data Ingress with Fixed-Rate Throttling
Backpressure
Dynamic Throttling
Tuning the Backpressure PID
Custom Rate Estimator
A Note on Alternative Dynamic Handling Strategies
Caching
Speculative Execution
Appendix C. References for Part III
Part IV. Advanced Spark Streaming Techniques
Chapter 27. Streaming Approximation and Sampling Algorithms
Exactness, Real Time, and Big Data
Exactness
Real-Time Processing
Big Data
The Exactness, Real-Time, and Big Data triangle
Big Data and Real Time
Approximation Algorithms
Hashing and Sketching: An Introduction
Counting Distinct Elements: HyperLogLog
Role-Playing Exercise: If We Were a System Administrator
Practical HyperLogLog in Spark
Counting Element Frequency: Count Min Sketches
Introducing Bloom Filters
Bloom Filters with Spark
Computing Frequencies with a Count-Min Sketch
Ranks and Quantiles: T-Digest
T-Digest in Spark
Reducing the Number of Elements: Sampling
Random Sampling
Stratified Sampling
Chapter 28. Real-Time Machine Learning
Streaming Classification with Naive Bayes
streamDM Introduction
Naive Bayes in Practice
Training a Movie Review Classifier
Introducing Decision Trees
Hoeffding Trees
Hoeffding Trees in Spark, in Practice
Streaming Clustering with Online K-Means
K-Means Clustering
Online Data and K-Means
The Problem of Decaying Clusters
Streaming K-Means with Spark Streaming
Appendix D. References for Part IV
Part V. Beyond Apache Spark
Chapter 29. Other Distributed Real-Time Stream Processing Systems
Apache Storm
Processing Model
The Storm Topology
The Storm Cluster
Compared to Spark
Apache Flink
A Streaming-First Framework
Compared to Spark
Kafka Streams
Kafka Streams Programming Model
Compared to Spark
In the Cloud
Amazon Kinesis on AWS
Microsoft Azure Stream Analytics
Apache Beam/Google Cloud Dataflow
Chapter 30. Looking Ahead
Stay Plugged In
Seek Help on Stack Overflow
Start Discussions on the Mailing Lists
Attend Conferences
Attend Meetups
Read Books
Contributing to the Apache Spark Project
Appendix E. References for Part V
Index
About the Authors
Colophon