Building Real-Time Analytics Systems: From Events to Insights with Apache Kafka and Apache Pinot

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Gain deep insight into real-time analytics, including the features of these systems and the problems they solve. With this practical book, data engineers at organizations that use event-processing systems such as Kafka, Google Pub/Sub, and AWS Kinesis will learn how to analyze data streams in real time. The faster you derive insights, the quicker you can spot changes in your business and act accordingly. Author Mark Needham from StarTree provides an overview of the real-time analytics space and an understanding of what goes into building real-time applications. The book's second part offers a series of hands-on tutorials that show you how to combine multiple software products to build real-time analytics applications for an imaginary pizza delivery service. You will: • Learn common architectures for real-time analytics • Discover how event processing differs from real-time analytics • Ingest event data from Apache Kafka into Apache Pinot • Combine event streams with OLTP data using Debezium and Kafka Streams • Write real-time queries against event data stored in Apache Pinot • Build a real-time dashboard and order tracking app • Learn how Uber, Stripe, and Just Eat use real-time analytics

Author(s): Mark Needham
Edition: 1
Publisher: O’Reilly Media
Year: 2023

Language: English
Commentary: Publisher's PDF
Pages: 218
City: Sebastopol, CA
Tags: Analytics; Stream Processing; Apache Kafka; ZooKeeper; Queries; Geospatial Data; Data Warehouse; Dashboards; Real-Time Systems; Apache Pinot; Debezium; Kafka Streams

Cover
Copyright
Table of Contents
Foreword
Preface
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Chapter 1. Introduction to Real-Time Analytics
What Is an Event Stream?
Making Sense of Streaming Data
What Is Real-Time Analytics?
Benefits of Real-Time Analytics
New Revenue Streams
Timely Access to Insights
Reduced Infrastructure Cost
Improved Overall Customer Experience
Real-Time Analytics Use Cases
User-Facing Analytics
Personalization
Metrics
Anomaly Detection and Root Cause Analysis
Visualization
Ad Hoc Analytics
Log Analytics/Text Search
Classifying Real-Time Analytics Applications
Internal Versus External Facing
Machine Versus Human Facing
Summary
Chapter 2. The Real-Time Analytics Ecosystem
Defining the Real-Time Analytics Ecosystem
The Classic Streaming Stack
Complex Event Processing
The Big Data Era
The Modern Streaming Stack
Event Producers
Streaming Data Platform
Stream Processing Layer
Serving Layer
Frontend
Summary
Chapter 3. Introducing All About That Dough: Real-Time Analytics on Pizza
Existing Architecture
Setup
MySQL
Apache Kafka
ZooKeeper
Orders Service
Spinning Up the Components
Inspecting the Data
Applications of Real-Time Analytics
Summary
Chapter 4. Querying Kafka with Kafka Streams
What Is Kafka Streams?
What Is Quarkus?
Quarkus Application
Installing the Quarkus CLI
Creating a Quarkus Application
Creating a Topology
Querying the Key-Value Store
Creating an HTTP Endpoint
Running the Application
Querying the HTTP Endpoint
Limitations of Kafka Streams
Summary
Chapter 5. The Serving Layer: Apache Pinot
Why Can’t We Use Another Stream Processor?
Why Can’t We Use a Data Warehouse?
What Is Apache Pinot?
How Does Pinot Model and Store Data?
Schema
Table
Setup
Data Ingestion
Pinot Data Explorer
Indexes
Updating the Web App
Summary
Chapter 6. Building a Real-Time Analytics Dashboard
Dashboard Architecture
What Is Streamlit?
Setup
Building the Dashboard
Summary
Chapter 7. Product Changes Captured with Change Data Capture
Capturing Changes from Operational Databases
Change Data Capture
Why Do We Need CDC?
What Is CDC?
What Are the Strategies for Implementing CDC?
Log-Based Data Capture
Requirements for a CDC System
Debezium
Applying CDC to AATD
Setup
Connecting Debezium to MySQL
Querying the Products Stream
Updating Products
Summary
Chapter 8. Joining Streams with Kafka Streams
Enriching Orders with Kafka Streams
Adding Order Items to Pinot
Updating the Orders Service
Refreshing the Streamlit Dashboard
Summary
Chapter 9. Upserts in the Serving Layer
Order Statuses
Enriched Orders Stream
Upserts in Apache Pinot
Updating the Orders Service
Creating UsersResource
Adding an allUsers Endpoint
Adding an Orders for User Endpoint
Adding an Individual Order Endpoint
Configuring Cross-Origin Resource Sharing
Frontend App
Order Statuses on the Dashboard
Time Spent in Each Order Status
Orders That Might Be Stuck
Summary
Chapter 10. Geospatial Querying
Delivery Statuses
Updating Apache Pinot
Orders
Delivery Statuses
Updating the Orders Service
Individual Orders
Delayed Orders by Area
Consuming the New API Endpoints
Summary
Chapter 11. Production Considerations
Preproduction
Capacity Planning
Data Partitioning
Throughput
Data Retention
Data Granularity
Total Data Size
Replication Factor
Deployment Platform
In-House Skills
Data Privacy and Security
Cost
Control
Postproduction
Monitoring and Alerting
Data Governance
Summary
Chapter 12. Real-Time Analytics in the Real World
Content Recommendation (Professional Social Network)
The Problem
The Solution
Benefits
Operational Analytics (Streaming Service)
The Problem
The Solution
Benefits
Real-Time Ad Analytics (Online Marketplace)
The Problem
The Solution
Benefits
User-Facing Analytics (Collaboration Platform)
The Problem
The Solution
Benefits
Summary
Chapter 13. The Future of Real-Time Analytics
Edge Analytics
Compute-Storage Separation
Data Lakehouses
Real-Time Data Visualization
Streaming Databases
Streaming Data Platform as a Service
Reverse ETL
Summary
Index
About the Author
Colophon