Learning and Operating Presto: Fast, Reliable SQL for Data Analytics and Lakehouses

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

The Presto community has mushroomed since its origins at Facebook in 2012. But ramping up this open source distributed SQL query engine can be challenging even for the most experienced engineers. With this practical book, data engineers and architects, platform engineers, cloud engineers, and software engineers will learn how to use Presto operations at your organization to derive insights on datasets wherever they reside.

Authors Angelica Lo Duca, Tim Meehan, Vivek Bharathan, and Ying Su explain what Presto is, where it came from, and how it differs from other data warehousing solutions. You'll discover why Facebook, Uber, Alibaba Cloud, Hewlett Packard Enterprise, IBM, Intel, and many more use Presto and how you can quickly deploy Presto in production.

With this book, you will:

  • Learn how to install and configure Presto
  • Use Presto with business intelligence tools
  • Understand how to connect Presto to a variety of data...
  • Author(s): Angelica Lo Duca
    Publisher: O'Reilly Media
    Year: 2023

    Language: English
    Pages: 191

    Preface
    Why We Wrote This Book
    Who This Book Is For
    Conventions Used in This Book
    Using Code Examples
    O’Reilly Online Learning
    How to Contact Us
    Acknowledgments
    Angelica Lo Duca
    Tim Meehan
    Vivek Bharathan
    Ying Su
    1. Introduction to Presto
    Data Warehouses and Data Lakes
    The Role of Presto in a Data Lake
    Presto Origins and Design Considerations
    High Performance
    High Scalability
    Compliance with the ANSI SQL Standard
    Federation of Data Sources
    Running in the Cloud
    Presto Architecture and Core Components
    Alternatives to Presto
    Apache Impala
    Apache Hive
    Spark SQL
    Trino
    Presto Use Cases
    Reporting and Dashboarding
    Ad Hoc Querying
    ETL Using SQL
    Data Lakehouse
    Real-Time Analytics with Real-Time Databases
    Introducing Our Case Study
    Conclusion
    2. Getting Started with Presto
    Presto Manual Installation
    Running Presto on Docker
    Installing Docker
    Presto Docker Image
    Dockerfile
    The etc/ directory
    node.properties
    jvm.config
    config.properties
    log.properties
    catalog/.properties
    Building and Running Presto on Docker
    The Presto Sandbox
    Deploying Presto on Kubernetes
    Introducing Kubernetes
    Configuring Presto on Kubernetes
    presto-coordinator.yaml
    presto-workers.yaml
    presto-config-map.yaml
    presto-secrets.yaml
    Adding a New Catalog
    Running the Deployment on Kubernetes
    Querying Your Presto Instance
    Listing Catalogs
    Listing Schemas
    Listing Tables
    Querying a Table
    Conclusion
    3. Connectors
    Service Provider Interface
    Connector Architecture
    Popular Connectors
    Thrift
    Writing a Custom Connector
    Prerequisites
    Plugin and Module
    ExamplePlugin
    ExampleConnectorFactory
    ExampleModule
    ExampleConnector
    ExampleHandleResolver
    Configuration
    ExampleConfig
    SessionProperties
    TableProperties
    Metadata
    Data model
    Handles
    ExampleMetadata
    ExampleClient
    Input/Output
    ExampleSplitManager
    ExampleSplit
    ExampleRecordSetProvider and ExampleRecordSet
    ExampleRecordCursor
    Deploying Your Connector
    Apache Pinot
    Setting Up and Configuring Presto
    Setting up Pinot
    Configuring Pinot
    Configuring Presto with Pinot
    Presto-Pinot Querying in Action
    Conclusion
    4. Client Connectivity
    Setting Up the Environment
    Presto Client
    Docker Image
    Kubernetes Node
    Connectivity to Presto
    REST API
    Python
    R
    JDBC
    Node.js
    ODBC
    Other Presto Client Libraries
    Building a Client Dashboard in Python
    Setting Up the Client
    Building the Dashboard
    Connecting to and querying Presto
    Preparing the results of the query
    Building the first graph
    Building the second graph
    Conclusion
    5. Open Data Lakehouse Analytics
    The Emergence of the Lakehouse
    Data Lakehouse Architecture
    Data Lake
    File Store
    File Format
    Table Format
    Query Engine
    Metadata Management
    Data Governance
    Data Access Control
    Building a Data Lakehouse
    Configuring MinIO
    Populating MinIO
    Configuring HMS
    Configuring Spark
    Registering Hudi Tables with HMS
    Connecting and Querying Presto
    Conclusion
    6. Presto Administration
    Introducing Presto Administration
    Configuration
    Properties
    How to configure a cluster
    Sessions
    Using sessions
    JVM
    Memory
    Out-of-memory errors
    Garbage collection
    Monitoring
    Console
    Using the console for monitoring
    Using the console for debugging
    Using the console for going over the interactive plan
    REST API
    Metrics
    JMX connector
    REST API
    JMX exporters
    Management
    Resource Groups
    Configuring resource groups
    Resource groups properties
    Example
    Verifiers
    Setting up the system
    Configuring the MySQL database
    Configuring the Presto verifier
    Running a test
    Session Properties Managers
    Configuring a session property manager
    Namespace Functions
    Setting up the system
    Configuring a function
    Running a test
    Conclusion
    7. Understanding Security in Presto
    Introducing Presto Security
    Building Secure Communication in Presto
    Encryption
    Keystore Management
    Configuring HTTPS/TLS
    Running a Presto client
    Running the Presto console
    Authentication
    File-Based Authentication
    Running a Presto client
    Running the Presto console
    LDAP
    Kerberos
    Prerequisites
    Configuring the Presto coordinator and workers
    Configuring the Presto client
    Creating a Custom Authenticator
    Authorization
    Authorizing Access to the Presto REST API
    Configuring System Access Control
    Authorization Through Apache Ranger
    Building a custom audit function
    Conclusion
    8. Performance Tuning
    Introducing Performance Tuning
    Reasons for Performance Tuning
    The Performance Tuning Life Cycle
    Query Execution Model
    Approaches for Performance Tuning in Presto
    Resource Allocation
    Storage
    Query Optimization
    Aria Scan
    Table Scanning
    Repartitioning
    Implementing Performance Tuning
    Building and Importing the Sample CSV Table in MinIO
    Converting the CSV Table in ORC
    Defining the Tuning Parameters
    Running Tests
    Default parameters
    Reducing CPU usage
    Query optimization
    Aria scan
    Conclusion
    9. Operating Presto at Scale
    Introducing Scalability
    Reasons to Scale Presto
    Common Issues
    Design Considerations
    Availability
    Manageability
    Performance
    Protection
    Configuration
    How to Scale Presto
    Multiple Coordinators
    Presto on Spark
    Spilling
    Using a Cloud Service
    Conclusion
    Index