Learning and Operating Presto: Fast, Reliable SQL for Data Analytics and Lakehouses

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

The Presto community has mushroomed since its origins at Facebook in 2012. But ramping up this open source distributed SQL query engine can be challenging even for the most experienced engineers. With this practical book, data engineers and architects, platform engineers, cloud engineers, and software engineers will learn how to use Presto operations at your organization to derive insights on datasets wherever they reside. Authors Angelica Lo Duca, Tim Meehan, Vivek Bharathan, and Ying Su explain what Presto is, where it came from, and how it differs from other data warehousing solutions. You'll discover why Facebook, Uber, Alibaba Cloud, Hewlett Packard Enterprise, IBM, Intel, and many more use Presto and how you can quickly deploy Presto in production. With this book, you will: • Learn how to install and configure Presto • Use Presto with business intelligence tools • Understand how to connect Presto to a variety of data sources • Extend Presto for real-time business insight • Learn how to apply best practices and tuning • Get troubleshooting tips for logs, error messages, and more • Explore Presto's architectural concepts and usage patterns • Understand Presto security and administration

Author(s): Angelica Lo Duca, Tim Meehan, Vivek Bharathan, Ying Su
Edition: 1
Publisher: O’Reilly Media
Year: 2023

Language: English
Commentary: Publisher's PDF
Pages: 191
City: Sebastopol, CA
Tags: Analytics; Databases; Security; SQL; Scalability; Performance Tuning; Data Lake; Data Warehouse; Presto

Copyright
Table of Contents
Preface
Why We Wrote This Book
Who This Book Is For
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Angelica Lo Duca
Tim Meehan
Vivek Bharathan
Ying Su
Chapter 1. Introduction to Presto
Data Warehouses and Data Lakes
The Role of Presto in a Data Lake
Presto Origins and Design Considerations
High Performance
High Scalability
Compliance with the ANSI SQL Standard
Federation of Data Sources
Running in the Cloud
Presto Architecture and Core Components
Alternatives to Presto
Apache Impala
Apache Hive
Spark SQL
Trino
Presto Use Cases
Reporting and Dashboarding
Ad Hoc Querying
ETL Using SQL
Data Lakehouse
Real-Time Analytics with Real-Time Databases
Introducing Our Case Study
Conclusion
Chapter 2. Getting Started with Presto
Presto Manual Installation
Running Presto on Docker
Installing Docker
Presto Docker Image
Building and Running Presto on Docker
The Presto Sandbox
Deploying Presto on Kubernetes
Introducing Kubernetes
Configuring Presto on Kubernetes
Adding a New Catalog
Running the Deployment on Kubernetes
Querying Your Presto Instance
Listing Catalogs
Listing Schemas
Listing Tables
Querying a Table
Conclusion
Chapter 3. Connectors
Service Provider Interface
Connector Architecture
Popular Connectors
Thrift
Writing a Custom Connector
Prerequisites
Plugin and Module
Configuration
Metadata
Input/Output
Deploying Your Connector
Apache Pinot
Setting Up and Configuring Presto
Presto-Pinot Querying in Action
Conclusion
Chapter 4. Client Connectivity
Setting Up the Environment
Presto Client
Docker Image
Kubernetes Node
Connectivity to Presto
REST API
Python
R
JDBC
Node.js
ODBC
Other Presto Client Libraries
Building a Client Dashboard in Python
Setting Up the Client
Building the Dashboard
Conclusion
Chapter 5. Open Data Lakehouse Analytics
The Emergence of the Lakehouse
Data Lakehouse Architecture
Data Lake
File Store
File Format
Table Format
Query Engine
Metadata Management
Data Governance
Data Access Control
Building a Data Lakehouse
Configuring MinIO
Configuring HMS
Configuring Spark
Registering Hudi Tables with HMS
Connecting and Querying Presto
Conclusion
Chapter 6. Presto Administration
Introducing Presto Administration
Configuration
Properties
Sessions
JVM
Monitoring
Console
REST API
Metrics
Management
Resource Groups
Verifiers
Session Properties Managers
Namespace Functions
Conclusion
Chapter 7. Understanding Security in Presto
Introducing Presto Security
Building Secure Communication in Presto
Encryption
Keystore Management
Configuring HTTPS/TLS
Authentication
File-Based Authentication
LDAP
Kerberos
Creating a Custom Authenticator
Authorization
Authorizing Access to the Presto REST API
Configuring System Access Control
Authorization Through Apache Ranger
Conclusion
Chapter 8. Performance Tuning
Introducing Performance Tuning
Reasons for Performance Tuning
The Performance Tuning Life Cycle
Query Execution Model
Approaches for Performance Tuning in Presto
Resource Allocation
Storage
Query Optimization
Aria Scan
Table Scanning
Repartitioning
Implementing Performance Tuning
Building and Importing the Sample CSV Table in MinIO
Converting the CSV Table in ORC
Defining the Tuning Parameters
Running Tests
Conclusion
Chapter 9. Operating Presto at Scale
Introducing Scalability
Reasons to Scale Presto
Common Issues
Design Considerations
Availability
Manageability
Performance
Protection
Configuration
How to Scale Presto
Multiple Coordinators
Presto on Spark
Spilling
Using a Cloud Service
Conclusion
Index
About the Authors
Colophon