With this practical book, AI and machine learning practitioners will learn how to successfully build and deploy data science projects on Amazon Web Services. The Amazon AI and machine learning stack unifies data science, data engineering, and application development to help level upyour skills. This guide shows you how to build and run pipelines in the cloud, then integrate the results into applications in minutes instead of days. Throughout the book, authors Chris Fregly and Antje Barth demonstrate how to reduce cost and improve performance.
• Apply the Amazon AI and ML stack to real-world use cases for natural language processing, computer vision, fraud detection, conversational devices, and more
• Use automated machine learning to implement a specific subset of use cases with SageMaker Autopilot
• Dive deep into the complete model development lifecycle for a BERT-based NLP use case including data ingestion, analysis, model training, and deployment
• Tie everything together into a repeatable machine learning operations pipeline
• Explore real-time ML, anomaly detection, and streaming analytics on data streams with Amazon Kinesis and Managed Streaming for Apache Kafka
• Learn security best practices for data science projects and workflows including identity and access management, authentication, authorization, and more
Author(s): Chris Fregly, Antje Barth
Edition: 1
Publisher: O'Reilly Media
Year: 2021
Language: English
Commentary: Vector PDF
Pages: 524
City: Sebastopol, CA
Tags: Amazon Web Services;Natural Language Processing;Data Science;Python;Apache Spark;Pipelines;Quantum Computing;Amazon Rekognition;AutoML;Data Lake;Data Warehouse;AWS Lambda;AWS SageMaker;AWS Athena;Cost Optimization;Amazon Lex;Amazon Redshift;MLOps;Data Engineering;Data Ingestion;Data Preparation;Model Training;Kuberflow;AWS Neptune;Data Exploration;Amazon Forecast;Amazon Fraud Detector;Amazon Macie;Amazon Polly;Amazon Transcribe;Amazon QuickSight;AWS Aurora;AWS DeepComposer;AWS DeepLens
Copyright
Table of Contents
Preface
Overview of the Chapters
Who Should Read This Book
Other Resources
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Chapter 1. Introduction to Data Science on AWS
Benefits of Cloud Computing
Agility
Cost Savings
Elasticity
Innovate Faster
Deploy Globally in Minutes
Smooth Transition from Prototype to Production
Data Science Pipelines and Workflows
Amazon SageMaker Pipelines
AWS Step Functions Data Science SDK
Kubeflow Pipelines
Managed Workflows for Apache Airflow on AWS
MLflow
TensorFlow Extended
Human-in-the-Loop Workflows
MLOps Best Practices
Operational Excellence
Security
Reliability
Performance Efficiency
Cost Optimization
Amazon AI Services and AutoML with Amazon SageMaker
Amazon AI Services
AutoML with SageMaker Autopilot
Data Ingestion, Exploration, and Preparation in AWS
Data Ingestion and Data Lakes with Amazon S3 and AWS Lake Formation
Data Analysis with Amazon Athena, Amazon Redshift, and Amazon QuickSight
Evaluate Data Quality with AWS Deequ and SageMaker Processing Jobs
Label Training Data with SageMaker Ground Truth
Data Transformation with AWS Glue DataBrew, SageMaker Data Wrangler, and SageMaker Processing Jobs
Model Training and Tuning with Amazon SageMaker
Train Models with SageMaker Training and Experiments
Built-in Algorithms
Bring Your Own Script (Script Mode)
Bring Your Own Container
Pre-Built Solutions and Pre-Trained Models with SageMaker JumpStart
Tune and Validate Models with SageMaker Hyper-Parameter Tuning
Model Deployment with Amazon SageMaker and AWS Lambda Functions
SageMaker Endpoints
SageMaker Batch Transform
Serverless Model Deployment with AWS Lambda
Streaming Analytics and Machine Learning on AWS
Amazon Kinesis Streaming
Amazon Managed Streaming for Apache Kafka
Streaming Predictions and Anomaly Detection
AWS Infrastructure and Custom-Built Hardware
SageMaker Compute Instance Types
GPUs and Amazon Custom-Built Compute Hardware
GPU-Optimized Networking and Custom-Built Hardware
Storage Options Optimized for Large-Scale Model Training
Reduce Cost with Tags, Budgets, and Alerts
Summary
Chapter 2. Data Science Use Cases
Innovation Across Every Industry
Personalized Product Recommendations
Recommend Products with Amazon Personalize
Generate Recommendations with Amazon SageMaker and TensorFlow
Generate Recommendations with Amazon SageMaker and Apache Spark
Detect Inappropriate Videos with Amazon Rekognition
Demand Forecasting
Predict Energy Consumption with Amazon Forecast
Predict Demand for Amazon EC2 Instances with Amazon Forecast
Identify Fake Accounts with Amazon Fraud Detector
Enable Privacy-Leak Detection with Amazon Macie
Conversational Devices and Voice Assistants
Speech Recognition with Amazon Lex
Text-to-Speech Conversion with Amazon Polly
Speech-to-Text Conversion with Amazon Transcribe
Text Analysis and Natural Language Processing
Translate Languages with Amazon Translate
Classify Customer-Support Messages with Amazon Comprehend
Extract Resume Details with Amazon Textract and Comprehend
Cognitive Search and Natural Language Understanding
Intelligent Customer Support Centers
Industrial AI Services and Predictive Maintenance
Home Automation with AWS IoT and Amazon SageMaker
Extract Medical Information from Healthcare Documents
Self-Optimizing and Intelligent Cloud Infrastructure
Predictive Auto Scaling for Amazon EC2
Anomaly Detection on Streams of Data
Cognitive and Predictive Business Intelligence
Ask Natural-Language Questions with Amazon QuickSight
Train and Invoke SageMaker Models with Amazon Redshift
Invoke Amazon Comprehend and SageMaker Models from Amazon Aurora SQL Database
Invoke SageMaker Model from Amazon Athena
Run Predictions on Graph Data Using Amazon Neptune
Educating the Next Generation of AI and ML Developers
Build Computer Vision Models with AWS DeepLens
Learn Reinforcement Learning with AWS DeepRacer
Understand GANs with AWS DeepComposer
Program Nature’s Operating System with Quantum Computing
Quantum Bits Versus Digital Bits
Quantum Supremacy and the Quantum Computing Eras
Cracking Cryptography
Molecular Simulations and Drug Discovery
Logistics and Financial Optimizations
Quantum Machine Learning and AI
Programming a Quantum Computer with Amazon Braket
AWS Center for Quantum Computing
Increase Performance and Reduce Cost
Automatic Code Reviews with CodeGuru Reviewer
Improve Application Performance with CodeGuru Profiler
Improve Application Availability with DevOps Guru
Summary
Chapter 3. Automated Machine Learning
Automated Machine Learning with SageMaker Autopilot
Track Experiments with SageMaker Autopilot
Train and Deploy a Text Classifier with SageMaker Autopilot
Train and Deploy with SageMaker Autopilot UI
Train and Deploy a Model with the SageMaker Autopilot Python SDK
Predict with Amazon Athena and SageMaker Autopilot
Train and Predict with Amazon Redshift ML and SageMaker Autopilot
Automated Machine Learning with Amazon Comprehend
Predict with Amazon Comprehend’s Built-in Model
Train and Deploy a Custom Model with the Amazon Comprehend UI
Train and Deploy a Custom Model with the Amazon Comprehend Python SDK
Summary
Chapter 4. Ingest Data into the Cloud
Data Lakes
Import Data into the S3 Data Lake
Describe the Dataset
Query the Amazon S3 Data Lake with Amazon Athena
Access Athena from the AWS Console
Register S3 Data as an Athena Table
Update Athena Tables as New Data Arrives with AWS Glue Crawler
Create a Parquet-Based Table in Athena
Continuously Ingest New Data with AWS Glue Crawler
Build a Lake House with Amazon Redshift Spectrum
Export Amazon Redshift Data to S3 Data Lake as Parquet
Share Data Between Amazon Redshift Clusters
Choose Between Amazon Athena and Amazon Redshift
Reduce Cost and Increase Performance
S3 Intelligent-Tiering
Parquet Partitions and Compression
Amazon Redshift Table Design and Compression
Use Bloom Filters to Improve Query Performance
Materialized Views in Amazon Redshift Spectrum
Summary
Chapter 5. Explore the Dataset
Tools for Exploring Data in AWS
Visualize Our Data Lake with SageMaker Studio
Prepare SageMaker Studio to Visualize Our Dataset
Run a Sample Athena Query in SageMaker Studio
Dive Deep into the Dataset with Athena and SageMaker
Query Our Data Warehouse
Run a Sample Amazon Redshift Query from SageMaker Studio
Dive Deep into the Dataset with Amazon Redshift and SageMaker
Create Dashboards with Amazon QuickSight
Detect Data-Quality Issues with Amazon SageMaker and Apache Spark
SageMaker Processing Jobs
Analyze Our Dataset with Deequ and Apache Spark
Detect Bias in Our Dataset
Generate and Visualize Bias Reports with SageMaker Data Wrangler
Detect Bias with a SageMaker Clarify Processing Job
Integrate Bias Detection into Custom Scripts with SageMaker Clarify Open Source
Mitigate Data Bias by Balancing the Data
Detect Different Types of Drift with SageMaker Clarify
Analyze Our Data with AWS Glue DataBrew
Reduce Cost and Increase Performance
Use a Shared S3 Bucket for Nonsensitive Athena Query Results
Approximate Counts with HyperLogLog
Dynamically Scale a Data Warehouse with AQUA for Amazon Redshift
Improve Dashboard Performance with QuickSight SPICE
Summary
Chapter 6. Prepare the Dataset for Model Training
Perform Feature Selection and Engineering
Select Training Features Based on Feature Importance
Balance the Dataset to Improve Model Accuracy
Split the Dataset into Train, Validation, and Test Sets
Transform Raw Text into BERT Embeddings
Convert Features and Labels to Optimized TensorFlow File Format
Scale Feature Engineering with SageMaker Processing Jobs
Transform with scikit-learn and TensorFlow
Transform with Apache Spark and TensorFlow
Share Features Through SageMaker Feature Store
Ingest Features into SageMaker Feature Store
Retrieve Features from SageMaker Feature Store
Ingest and Transform Data with SageMaker Data Wrangler
Track Artifact and Experiment Lineage with Amazon SageMaker
Understand Lineage-Tracking Concepts
Show Lineage of a Feature Engineering Job
Understand the SageMaker Experiments API
Ingest and Transform Data with AWS Glue DataBrew
Summary
Chapter 7. Train Your First Model
Understand the SageMaker Infrastructure
Introduction to SageMaker Containers
Increase Availability with Compute and Network Isolation
Deploy a Pre-Trained BERT Model with SageMaker JumpStart
Develop a SageMaker Model
Built-in Algorithms
Bring Your Own Script
Bring Your Own Container
A Brief History of Natural Language Processing
BERT Transformer Architecture
Training BERT from Scratch
Masked Language Model
Next Sentence Prediction
Fine Tune a Pre-Trained BERT Model
Create the Training Script
Setup the Train, Validation, and Test Dataset Splits
Set Up the Custom Classifier Model
Train and Validate the Model
Save the Model
Launch the Training Script from a SageMaker Notebook
Define the Metrics to Capture and Monitor
Configure the Hyper-Parameters for Our Algorithm
Select Instance Type and Instance Count
Putting It All Together in the Notebook
Download and Inspect Our Trained Model from S3
Show Experiment Lineage for Our SageMaker Training Job
Show Artifact Lineage for Our SageMaker Training Job
Evaluate Models
Run Some Ad Hoc Predictions from the Notebook
Analyze Our Classifier with a Confusion Matrix
Visualize Our Neural Network with TensorBoard
Monitor Metrics with SageMaker Studio
Monitor Metrics with CloudWatch Metrics
Debug and Profile Model Training with SageMaker Debugger
Detect and Resolve Issues with SageMaker Debugger Rules and Actions
Profile Training Jobs
Interpret and Explain Model Predictions
Detect Model Bias and Explain Predictions
Detect Bias with a SageMaker Clarify Processing Job
Feature Attribution and Importance with SageMaker Clarify and SHAP
More Training Options for BERT
Convert TensorFlow BERT Model to PyTorch
Train PyTorch BERT Models with SageMaker
Train Apache MXNet BERT Models with SageMaker
Train BERT Models with PyTorch and AWS Deep Java Library
Reduce Cost and Increase Performance
Use Small Notebook Instances
Test Model-Training Scripts Locally in the Notebook
Profile Training Jobs with SageMaker Debugger
Start with a Pre-Trained Model
Use 16-Bit Half Precision and bfloat16
Mixed 32-Bit Full and 16-Bit Half Precision
Quantization
Use Training-Optimized Hardware
Spot Instances and Checkpoints
Early Stopping Rule in SageMaker Debugger
Summary
Chapter 8. Train and Optimize Models at Scale
Automatically Find the Best Model Hyper-Parameters
Set Up the Hyper-Parameter Ranges
Run the Hyper-Parameter Tuning Job
Analyze the Best Hyper-Parameters from the Tuning Job
Show Experiment Lineage for Our SageMaker Tuning Job
Use Warm Start for Additional SageMaker Hyper-Parameter Tuning Jobs
Run HPT Job Using Warm Start
Analyze the Best Hyper-Parameters from the Warm-Start Tuning Job
Scale Out with SageMaker Distributed Training
Choose a Distributed-Communication Strategy
Choose a Parallelism Strategy
Choose a Distributed File System
Launch the Distributed Training Job
Reduce Cost and Increase Performance
Start with Reasonable Hyper-Parameter Ranges
Shard the Data with ShardedByS3Key
Stream Data on the Fly with Pipe Mode
Enable Enhanced Networking
Summary
Chapter 9. Deploy Models to Production
Choose Real-Time or Batch Predictions
Real-Time Predictions with SageMaker Endpoints
Deploy Model Using SageMaker Python SDK
Track Model Deployment in Our Experiment
Analyze the Experiment Lineage of a Deployed Model
Invoke Predictions Using the SageMaker Python SDK
Invoke Predictions Using HTTP POST
Create Inference Pipelines
Invoke SageMaker Models from SQL and Graph-Based Queries
Auto-Scale SageMaker Endpoints Using Amazon CloudWatch
Define a Scaling Policy with AWS-Provided Metrics
Define a Scaling Policy with a Custom Metric
Tuning Responsiveness Using a Cooldown Period
Auto-Scale Policies
Strategies to Deploy New and Updated Models
Split Traffic for Canary Rollouts
Shift Traffic for Blue/Green Deployments
Testing and Comparing New Models
Perform A/B Tests to Compare Model Variants
Reinforcement Learning with Multiarmed Bandit Testing
Monitor Model Performance and Detect Drift
Enable Data Capture
Understand Baselines and Drift
Monitor Data Quality of Deployed SageMaker Endpoints
Create a Baseline to Measure Data Quality
Schedule Data-Quality Monitoring Jobs
Inspect Data-Quality Results
Monitor Model Quality of Deployed SageMaker Endpoints
Create a Baseline to Measure Model Quality
Schedule Model-Quality Monitoring Jobs
Inspect Model-Quality Monitoring Results
Monitor Bias Drift of Deployed SageMaker Endpoints
Create a Baseline to Detect Bias
Schedule Bias-Drift Monitoring Jobs
Inspect Bias-Drift Monitoring Results
Monitor Feature Attribution Drift of Deployed SageMaker Endpoints
Create a Baseline to Monitor Feature Attribution
Schedule Feature Attribution Drift Monitoring Jobs
Inspect Feature Attribution Drift Monitoring Results
Perform Batch Predictions with SageMaker Batch Transform
Select an Instance Type
Set Up the Input Data
Tune the SageMaker Batch Transform Configuration
Prepare the SageMaker Batch Transform Job
Run the SageMaker Batch Transform Job
Review the Batch Predictions
AWS Lambda Functions and Amazon API Gateway
Optimize and Manage Models at the Edge
Deploy a PyTorch Model with TorchServe
TensorFlow-BERT Inference with AWS Deep Java Library
Reduce Cost and Increase Performance
Delete Unused Endpoints and Scale In Underutilized Clusters
Deploy Multiple Models in One Container
Attach a GPU-Based Elastic Inference Accelerator
Optimize a Trained Model with SageMaker Neo and TensorFlow Lite
Use Inference-Optimized Hardware
Summary
Chapter 10. Pipelines and MLOps
Machine Learning Operations
Software Pipelines
Machine Learning Pipelines
Components of Effective Machine Learning Pipelines
Steps of an Effective Machine Learning Pipeline
Pipeline Orchestration with SageMaker Pipelines
Create an Experiment to Track Our Pipeline Lineage
Define Our Pipeline Steps
Configure the Pipeline Parameters
Create the Pipeline
Start the Pipeline with the Python SDK
Start the Pipeline with the SageMaker Studio UI
Approve the Model for Staging and Production
Review the Pipeline Artifact Lineage
Review the Pipeline Experiment Lineage
Automation with SageMaker Pipelines
GitOps Trigger When Committing Code
S3 Trigger When New Data Arrives
Time-Based Schedule Trigger
Statistical Drift Trigger
More Pipeline Options
AWS Step Functions and the Data Science SDK
Kubeflow Pipelines
Apache Airflow
MLflow
TensorFlow Extended
Human-in-the-Loop Workflows
Improving Model Accuracy with Amazon A2I
Active-Learning Feedback Loops with SageMaker Ground Truth
Reduce Cost and Improve Performance
Cache Pipeline Steps
Use Less-Expensive Spot Instances
Summary
Chapter 11. Streaming Analytics and Machine Learning
Online Learning Versus Offline Learning
Streaming Applications
Windowed Queries on Streaming Data
Stagger Windows
Tumbling Windows
Sliding Windows
Streaming Analytics and Machine Learning on AWS
Classify Real-Time Product Reviews with Amazon Kinesis, AWS Lambda, and Amazon SageMaker
Implement Streaming Data Ingest Using Amazon Kinesis Data Firehose
Create Lambda Function to Invoke SageMaker Endpoint
Create the Kinesis Data Firehose Delivery Stream
Put Messages on the Stream
Summarize Real-Time Product Reviews with Streaming Analytics
Setting Up Amazon Kinesis Data Analytics
Create a Kinesis Data Stream to Deliver Data to a Custom Application
Create AWS Lambda Function to Send Notifications via Amazon SNS
Create AWS Lambda Function to Publish Metrics to Amazon CloudWatch
Transform Streaming Data in Kinesis Data Analytics
Understand In-Application Streams and Pumps
Amazon Kinesis Data Analytics Applications
Calculate Average Star Rating
Detect Anomalies in Streaming Data
Calculate Approximate Counts of Streaming Data
Create Kinesis Data Analytics Application
Start the Kinesis Data Analytics Application
Put Messages on the Stream
Classify Product Reviews with Apache Kafka, AWS Lambda, and Amazon SageMaker
Reduce Cost and Improve Performance
Aggregate Messages
Consider Kinesis Firehose Versus Kinesis Data Streams
Enable Enhanced Fan-Out for Kinesis Data Streams
Summary
Chapter 12. Secure Data Science on AWS
Shared Responsibility Model Between AWS and Customers
Applying AWS Identity and Access Management
IAM Users
IAM Policies
IAM User Roles
IAM Service Roles
Specifying Condition Keys for IAM Roles
Enable Multifactor Authentication
Least Privilege Access with IAM Roles and Policies
Resource-Based IAM Policies
Identity-Based IAM Policies
Isolating Compute and Network Environments
Virtual Private Cloud
VPC Endpoints and PrivateLink
Limiting Athena APIs with a VPC Endpoint Policy
Securing Amazon S3 Data Access
Require a VPC Endpoint with an S3 Bucket Policy
Limit S3 APIs for an S3 Bucket with a VPC Endpoint Policy
Restrict S3 Bucket Access to a Specific VPC with an S3 Bucket Policy
Limit S3 APIs with an S3 Bucket Policy
Restrict S3 Data Access Using IAM Role Policies
Restrict S3 Bucket Access to a Specific VPC with an IAM Role Policy
Restrict S3 Data Access Using S3 Access Points
Encryption at Rest
Create an AWS KMS Key
Encrypt the Amazon EBS Volumes During Training
Encrypt the Uploaded Model in S3 After Training
Store Encryption Keys with AWS KMS
Enforce S3 Encryption for Uploaded S3 Objects
Enforce Encryption at Rest for SageMaker Jobs
Enforce Encryption at Rest for SageMaker Notebooks
Enforce Encryption at Rest for SageMaker Studio
Encryption in Transit
Post-Quantum TLS Encryption in Transit with KMS
Encrypt Traffic Between Training-Cluster Containers
Enforce Inter-Container Encryption for SageMaker Jobs
Securing SageMaker Notebook Instances
Deny Root Access Inside SageMaker Notebooks
Disable Internet Access for SageMaker Notebooks
Securing SageMaker Studio
Require a VPC for SageMaker Studio
SageMaker Studio Authentication
Securing SageMaker Jobs and Models
Require a VPC for SageMaker Jobs
Require Network Isolation for SageMaker Jobs
Securing AWS Lake Formation
Securing Database Credentials with AWS Secrets Manager
Governance
Secure Multiaccount AWS Environments with AWS Control Tower
Manage Accounts with AWS Organizations
Enforce Account-Level Permissions with SCPs
Implement Multiaccount Model Deployments
Auditability
Tag Resources
Log Activities and Collect Events
Track User Activity and API Calls
Reduce Cost and Improve Performance
Limit Instance Types to Control Cost
Quarantine or Delete Untagged Resources
Use S3 Bucket KMS Keys to Reduce Cost and Increase Performance
Summary
Index
About the Authors
Colophon