Data Wrangling on AWS: Clean and organize complex data for analysis

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Data wrangling is the process of cleaning, transforming, and organizing raw, messy, or unstructured data into a structured format. It involves processes such as data cleaning, data integration, data transformation, and data enrichment to ensure that the data is accurate, consistent, and suitable for analysis. Data Wrangling on AWS equips you with the knowledge to reap the full potential of AWS data wrangling tools. First, you’ll be introduced to data wrangling on AWS and will be familiarized with data wrangling services available in AWS. You’ll understand how to work with AWS Glue DataBrew, AWS data wrangler, and AWS Sagemaker. Next, you’ll discover other AWS services like Amazon S3, Redshift, Athena, and Quicksight. Additionally, you’ll explore advanced topics such as performing Pandas data operation with AWS data wrangler, optimizing ML data with AWS SageMaker, building the data warehouse with Glue DataBrew, along with security and monitoring aspects. By the end of this book, you’ll be well-equipped to perform data wrangling using AWS services.

Author(s): Navnit Shukla | Sankar M | Sam Palani
Edition: 1
Publisher: Packt Publishing Pvt Ltd
Year: 2023

Language: English
Pages: 886

Data Wrangling on AWS
Contributors
About the authors
About the reviewer
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Conventions used
Get in touch
Share Your Thoughts
Download a free PDF copy of this book
Part 1:Unleashing Data Wrangling with AWS
1
Getting Started with Data Wrangling
Introducing data wrangling
The 80-20 rule of data analysis
Advantages of data wrangling
The steps involved in data wrangling
Data discovery
Data structuring
Data cleaning
Data enrichment
Data validation
Data publishing
Best practices for data wrangling
Identifying the business use case
Identifying the data source and bringing the right data
Identifying your audience
Options available for data wrangling on AWS
AWS Glue DataBrew
SageMaker Data Wrangler
AWS SDK for pandas
Summary
Part 2:Data Wrangling with AWS Tools
2
Introduction to AWS Glue DataBrew
Why AWS Glue DataBrew?
AWS Glue DataBrew’s basic building blocks
Getting started with AWS Glue DataBrew
Understanding the pricing of AWS Glue DataBrew
Using AWS Glue DataBrew for data wrangling
Identifying the dataset
Downloading the sample dataset
Data discovery – creating an AWS Glue DataBrew profile for a dataset
Data cleaning and enrichment – AWS Glue DataBrew transforms
Data validation – performing data quality checks using AWS Glue DataBrew
Data publication – fixing data quality issues
Event-driven data quality check using Glue DataBrew
Data protection with AWS Glue DataBrew
Encryption at rest
Encryption in transit
Identifying and handling PII
Data lineage and data publication
Summary
3
Introducing AWS SDK for pandas
AWS SDK for pandas
Building blocks of AWS SDK for pandas
Arrow
pandas
Boto3
Customizing, building, and installing AWS SDK for pandas for different use cases
Standard and custom installation on your local machine or Amazon EC2
Standard and custom installation with Lambda functions
Standard and custom installation for AWS Glue jobs
Standard and custom installation on Amazon SageMaker notebooks
Configuration options for AWS SDK for pandas
Setting up global variables
Common use cases for configuring
The features of AWS SDK for pandas with different AWS services
Amazon S3
Amazon Athena
RDS databases
Redshift
Summary
4
Introduction to SageMaker Data Wrangler
Data import
Data orchestration
Data transformation
Insights and data quality
Data analysis
Data export
SageMaker Studio setup prerequisites
Prerequisites
Studio domain
Studio onboarding steps
Summary
Part 3:AWS Data Management and Analysis
5
Working with Amazon S3
What is big data?
5 Vs of big data
What is a data lake?
Building a data lake on Amazon S3
Advantages of building a data lake on Amazon S3
Design principles to design a data lake on Amazon S3
Data lake layouts
Organizing and structuring data within an Amazon S3 data lake
Process of building a data lake on Amazon S3
Selecting the right file format for a data lake
Selecting the right compression method for a data lake
Choosing the right partitioning strategy for a data lake
Configuring Amazon S3 Lifecycle for a data lake
Optimizing the number of files and the size of each file
Challenges and considerations when building a data lake on Amazon S3
Summary
6
Working with AWS Glue
What is Apache Spark?
Apache Spark architecture
Apache Spark framework
Resilient Distributed Datasets
Datasets and DataFrames
Data discovery with AWS Glue
AWS Glue Data Catalog
Glue Connections
AWS Glue crawlers
Table stats
Data ingestion using AWS Glue ETL
AWS GlueContext
DynamicFrame
AWS Glue Job bookmarks
AWS Glue Triggers
AWS Glue interactive sessions
AWS Glue Studio
Ingesting data from object stores
Summary
7
Working with Athena
Understanding Amazon Athena
When to use SQL/Spark analysis options?
Advanced data discovery and data structuring with Athena
SQL-based data discovery with Athena
Using CTAS for data structuring
Enriching data from multiple sources using Athena
Enriching data using Athena SQL joins
Setting up data federation for source databases
Enriching data with data federation
Setting up a serverless data quality pipeline with Athena
Implementing data quality rules in Athena
Amazon DynamoDB as a metadata store for data quality pipelines
Serverless data quality pipeline
Automating the data quality pipeline
Summary
8
Working with QuickSight
Introducing Amazon QuickSight and its concepts
Data discovery with QuickSight
QuickSight-supported data sources and setup
Data discovery with QuickSight analysis
QuickSight Q and AI-based data analysis/discovery
Data visualization with QuickSight
Visualization and charts with QuickSight
Embedded analytics
Summary
Part 4:Advanced Data Manipulation and ML Data Optimization
9
Building an End-to-End Data-Wrangling Pipeline with AWS SDK for Pandas
A solution walkthrough for sportstickets.com
Prerequisites for data ingestion
When would you use them?
Loading sample data into a source database
Data discovery
Exploring data using S3 Select commands
Access through Amazon Athena and the Glue Catalog
Data structuring
Different file formats and when to use them
Restructuring data using Pandas
Flattening nested data with Pandas
Data cleaning
Data cleansing with Pandas
Data enrichment
Pandas operations for data transformation
Data quality validation
Data quality validation with Pandas
Data quality validation integration with a data pipeline
Data visualization
Visualization with Python libraries
Summary
10
Data Processing for Machine Learning with SageMaker Data Wrangler
Technical requirements
Step 1 – logging in to SageMaker Studio
Step 2 – importing data
Exploratory data analysis
Built-in data insights
Step 3 – creating data analysis
Step 4 – adding transformations
Categorical encoding
Custom transformation
Numeric scaling
Dropping columns
Step 5 – exporting data
Training a machine learning model
Summary
Part 5:Ensuring Data Lake Security and Monitoring
11
Data Lake Security and Monitoring
Data lake security
Data lake access control
Additional options to control data lake access
AWS Lake Formation integration
Data protection
Securing your data in AWS Glue
Monitoring and auditing
Amazon CloudWatch
Monitoring an AWS Glue job using AWS Glue ETL job monitoring
Amazon CloudTrail
Summary
Index
Why subscribe?
Other Books You May Enjoy
Packt is searching for authors like you
Share Your Thoughts
Download a free PDF copy of this book