Amazon Redshift: The Definitive Guide: Jump-Start Analytics Using Cloud Data Warehousing

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Amazon Redshift powers analytic cloud data warehouses worldwide, from startups to some of the largest enterprise data warehouses available today. This practical guide thoroughly examines this managed service and demonstrates how you can use it to extract value from your data immediately, rather than go through the heavy lifting required to run a typical data warehouse.

Analytic specialists Rajesh Francis, Rajiv Gupta, and Milind Oke detail Amazon Redshift's underlying mechanisms and options to help you explore out-of-the box automation. Whether you're a data engineer who wants to learn the art of the possible or a DBA looking to take advantage of machine learning-based auto-tuning, this book helps you get the most value from Amazon Redshift.

By understanding Amazon Redshift features, you'll achieve excellent analytic performance at the best price, with the least effort. This book helps you:

  • Build a cloud data strategy around Amazon Redshift as foundational...
  • Author(s): Rajesh Francis
    Publisher: O'Reilly Media
    Year: 2023

    Language: English
    Pages: 456

    Foreword
    Preface
    Conventions Used in This Book
    Using Code Examples
    O’Reilly Online Learning
    How to Contact Us
    Acknowledgments
    1. AWS for Data
    Data-Driven Organizations
    Business Use Cases
    New Business Use Cases with Generative AI
    Modern Data Strategy
    Comprehensive Set of Capabilities
    Integrated Set of Tools
    End-to-End Data Governance
    Modern Data Architecture
    Role of Amazon Redshift in a Modern Data Architecture
    Real-World Benefits of Adopting a Modern Data Architecture
    Reference Architecture for Modern Data Architecture
    Data Sourcing
    Extract, Transform, and Load
    Storage
    Storage in the data warehouse
    Storage in the data lake
    Analysis
    Comparing transactional databases, data warehouses, and data lakes
    Data Mesh and Data Fabric
    Data Mesh
    Data Fabric
    Summary
    2. Getting Started with Amazon Redshift
    Amazon Redshift Architecture Overview
    Get Started with Amazon Redshift Serverless
    Creating an Amazon Redshift Serverless Data Warehouse
    Sample Data
    Activate Sample Data Models and Query Using the Query Editor
    When to Use a Provisioned Cluster?
    Creating an Amazon Redshift Provisioned Cluster
    Estimate Your Amazon Redshift Cost
    Amazon Redshift Managed Storage
    Amazon Redshift Serverless Compute Cost
    Setting a different value for the base capacity
    High/frequent usage
    Amazon Redshift Provisioned Compute Cost
    AWS Account Management
    Connecting to Your Amazon Redshift Data Warehouse
    Private/Public VPC and Secure Access
    Stored Password
    Temporary Credentials
    Federated User
    SAML-Based Authentication from an Identity Provider
    Native IdP Integration
    Amazon Redshift Data API
    Querying a Database Using the Query Editor V2
    Federated user
    Temporary credentials
    Database username and password
    AWS Secrets Manager
    Business Intelligence Using Amazon QuickSight
    Connecting to Amazon Redshift Using JDBC/ODBC
    Summary
    3. Setting Up Your Data Models and Ingesting Data
    Data Lake First Versus Data Warehouse First Strategy
    Data Lake First Strategy
    Data Warehouse First Strategy
    Deciding On a Strategy
    Defining Your Data Model
    Database Schemas, Users, and Groups
    Star Schema, Denormalized, Normalized
    Student Information Learning Analytics Dataset
    Create Data Models for Student Information Learning Analytics Dataset
    Load Batch Data into Amazon Redshift
    Using the COPY Command
    Ingest Data for the Student Learning Analytics Dataset
    Building a Star Schema
    Continuous File Ingestion from Amazon S3
    Using AWS Glue for Transformations
    Manual Loading Using SQL Commands
    Using the Query Editor V2
    Load Real-Time and Near Real-Time Data
    Near Real-Time Replication Using AWS Database Migration Service
    Amazon Aurora Zero-ETL Integration with Amazon Redshift
    Using Amazon AppFlow
    Streaming Ingestion
    Steps to get started with streaming ingestion
    Important considerations and best practices
    Optimize Your Data Structures
    Automatic Table Optimization and Autonomics
    Distribution Style
    Sort Key
    Compression Encoding
    Summary
    4. Data Transformation Strategies
    Comparing ELT and ETL Strategies
    In-Database Transformation
    Semistructured Data
    User-Defined Functions
    Stored Procedures
    Scheduling and Orchestration
    Access All Your Data
    External Amazon S3 Data
    External Operational Data
    External Amazon Redshift Data
    External Transformation
    AWS Glue
    Register Amazon Redshift target connection
    Build and run your AWS Glue job
    Summary
    5. Scaling and Performance Optimizations
    Scale Storage
    Autoscale Your Serverless Data Warehouse
    Scale Your Provisioned Data Warehouse
    Evolving Compute Demand
    Predictable workload changes
    Unpredictable Workload Changes
    WLM, Queues, and QMR
    Queue Assignment
    Short Query Acceleration
    Query Monitoring Rules
    Automatic WLM
    Manual WLM
    Parameter Group
    WLM Dynamic Memory Allocation
    Materialized Views
    Autonomics
    Auto Table Optimizer and Smart Defaults
    Auto Vacuum
    Auto Vacuum Sort
    Auto Analyze
    Auto Materialized Views (AutoMV)
    Amazon Redshift Advisor
    Workload Isolation
    Additional Optimizations for Achieving the Best Price and Performance
    Database Versus Data Warehouse
    Amazon Redshift Serverless
    Multi-Warehouse Environment
    AWS Data Exchange
    Table Design
    Indexes Versus Zone Maps
    Drivers
    Simplify ETL
    Query Editor V2
    Query Tuning
    Query Processing
    Query planning and execution workflow
    Query stages and system tables
    Understanding the query plan
    Factors affecting query performance
    Analyzing Queries
    Reviewing query alerts
    Analyzing the query plan
    Identifying Queries for Performance Tuning
    Summary
    6. Amazon Redshift Machine Learning
    Machine Learning Cycle
    Amazon Redshift ML
    Amazon Redshift ML Flexibility
    Getting Started with Amazon Redshift ML
    Machine Learning Techniques
    Supervised Learning Techniques
    Unsupervised Learning Techniques
    Machine Learning Algorithms
    Integration with Amazon SageMaker Autopilot
    Create Model
    Label Probability
    Explain Model
    Using Amazon Redshift ML to Predict Student Outcomes
    Amazon SageMaker Integration with Amazon Redshift
    Integration with Amazon SageMaker—Bring Your Own Model (BYOM)
    BYOM Local
    BYOM Remote
    Amazon Redshift ML Costs
    Summary
    7. Collaboration with Data Sharing
    Amazon Redshift Data Sharing Overview
    Data Sharing Use Cases
    Key Concepts of Data Sharing
    How to Use Data Sharing
    Sharing Data Within the Same Account
    Sharing Data Across Accounts Using Cross-Account Data Sharing
    Analytics as a Service Use Case with Multi-Tenant Storage Patterns
    Scaling Your Multi-tenant Architecture Using Data Sharing
    Multi-tenant Storage Patterns Using Data Sharing
    Pool model
    Creating database views in the producer
    Creating datashares in producer and granting usage to the consumer
    Using Role-Level Security
    Bridge model
    Creating database schemas and tables in the producer
    Creating datashares in the producer and granting usage to the consumer
    Silo model
    Creating databases and datashares in the producer
    Creating datashares in the producer and granting usage to the consumer
    External Data Sharing with AWS ADX Integration
    Publishing a Data Product
    Subscribing to a Published Data Product
    Considerations When Using AWS Data Exchange for Amazon Redshift
    Query from the Data Lake and Unload to the Data Lake
    Amazon DataZone to Discover and Share Data
    Use Cases for a Data Mesh Architecture with Amazon DataZone
    Key Capabilities and Use Cases for Amazon DataZone
    Amazon DataZone Integrations with Amazon Redshift and Other AWS Services
    Components and Capabilities of Amazon DataZone
    Business data catalog
    Projects
    Data governance and access control
    Data portal
    Getting Started with Amazon DataZone
    Step 1: Create the domain and data portal
    Step 2: Create a producer project
    Step 3: Produce data for publishing in Amazon DataZone
    Step 4: Publish a data product to the catalog
    Step 5: Create a consumer project
    Step 6: Discovering and consuming data in Amazon DataZone
    Step 7: Approve access to a published data asset as a producer
    Step 8: Analyze a published data asset as a consumer
    Security in Amazon DataZone
    Using Lake Formation-based authorization
    Encryption
    Implement least privilege access
    Use IAM roles
    Summary
    8. Securing and Governing Data
    Object-Level Access Controls
    Object Ownership
    Default Privileges
    Public Schema and Search Path
    Access Controls in Action
    Database Roles
    Database Roles in Action
    Row-Level Security
    Row-Level Security in Action
    Row-Level Security Considerations
    Dynamic Data Masking
    Dynamic Data Masking in Action
    Dynamic Data Masking Considerations
    External Data Access Control
    Associate IAM Roles
    Authorize Assume Role Privileges
    Establish External Schemas
    Lake Formation for Fine-Grained Access Control
    Summary
    9. Migrating to Amazon Redshift
    Migration Considerations
    Retire Versus Retain
    Migration Data Size
    Platform-Specific Transformations Required
    Data Volatility and Availability Requirements
    Selection of Migration and ETL Tools
    Data Movement Considerations
    Domain Name System (DNS)
    Migration Strategies
    One-Step Migration
    Two-Step Migration
    Initial data migration
    Changed data migration
    Iterative Migration
    Migration Tools and Services
    AWS Schema Conversion Tool
    SCT overview
    SCT migration assessment report
    SCT data extraction agents
    Migrating BLOBs to Amazon Redshift
    Data Warehouse Migration Service
    How AWS DMS works
    DMS replication instances
    DMS replication validation
    AWS Snow Family
    AWS Snow Family key features
    AWS Snow Family devices
    AWS Snowball Edge Client
    Database Migration Process
    Step 1: Convert Schema and Subject Area
    Step 2: Initial Data Extraction and Load
    Step 3: Incremental Load Through Data Capture
    Amazon Redshift Migration Tools Considerations
    Accelerate Your Migration to Amazon Redshift
    Macro Conversion
    Case-Insensitive String Comparison
    Recursive Common Table Expressions
    Proprietary Data Types
    Summary
    10. Monitoring and Administration
    Amazon Redshift Monitoring Overview
    Monitoring
    Troubleshooting
    Optimization
    Monitoring Using Console
    Monitoring and Administering Serverless
    Query and database monitoring serverless
    Serverless query and database monitoring
    Serverless query monitoring drill-down query
    Serverless query monitoring drill-down query plan
    Serverless query monitoring drill-down related metrics
    Resource monitoring
    Monitoring Provisioned Data Warehouse Using Console
    Data warehouse performance and resource utilization metrics
    View Performance Data
    CPU utilization
    Percentage disk space used
    Database connections
    Query duration
    Query throughput
    Query and data ingestion performance metrics: Query Monitoring tab
    Query history at data warehouse level
    Database performance for queries
    Workload concurrency
    Monitoring Queries and Loads Across Clusters
    Monitoring queries and loads
    Monitoring top queries
    Identifying Systemic Query Performance Problems
    Monitoring Using Amazon CloudWatch
    Amazon Redshift CloudWatch Metrics
    Monitoring Using System Tables and Views
    Monitoring Serverless Using System Views
    High Availability and Disaster Recovery
    Recovery Time Objective and Recovery Point Objective Considerations
    Multi-AZ Compared to Single-AZ Deployment
    Creating or Converting a Provisioned Data Warehouse with Multi-AZ Configuration
    Creating a new data warehouse with Multi-AZ option
    Migrating an existing data warehouse from Single-AZ to Multi-AZ
    Auto Recovery of Multi-AZ Deployment
    Snapshots, Backup, and Restore
    Snapshots for Backup
    Automated Snapshots
    Manual Snapshots
    Disaster Recovery Using Cross-Region Snapshots
    Using Snapshots for Simple-Replay
    Monitoring Amazon Redshift Using CloudTrail
    Bring Your Own Visualization Tool to Monitor Amazon Redshift
    Monitor Operational Metrics Using System Tables and Amazon QuickSight
    Monitor Operational Metrics Using Grafana Plug-in for Amazon Redshift
    Summary
    Index