Data Wrangling with SQL: A hands-on guide to manipulating, wrangling, and engineering data using SQL

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Become a data wrangling expert and make well-informed decisions by effectively utilizing and analyzing raw unstructured data in a systematic manner Key Features Implement query optimization during data wrangling using the SQL language with practical use cases Master data cleaning, handle the date function and null value, and write subqueries and window functions Practice self-assessment questions for SQL-based interviews and real-world case study rounds Book Description The amount of data generated continues to grow rapidly, making it increasingly important for businesses to be able to wrangle this data and understand it quickly and efficiently. Although data wrangling can be challenging, with the right tools and techniques you can efficiently handle enormous amounts of unstructured data. The book starts by introducing you to the basics of SQL, focusing on the core principles and techniques of data wrangling. You’ll then explore advanced SQL concepts like aggregate functions, window functions, CTEs, and subqueries that are very popular in the business world. The next set of chapters will walk you through different functions within SQL query that cause delays in data transformation and help you figure out the difference between a good query and bad one. You’ll also learn how data wrangling and data science go hand in hand. The book is filled with datasets and practical examples to help you understand the concepts thoroughly, along with best practices to guide you at every stage of data wrangling. By the end of this book, you’ll be equipped with essential techniques and best practices for data wrangling, and will predominantly learn how to use clean and standardized data models to make informed decisions, helping businesses avoid costly mistakes. What you will learn Build time series models using data wrangling Discover data wrangling best practices as well as tips and tricks Find out how to use subqueries, window functions, CTEs, and aggregate functions Handle missing data, data types, date formats, and redundant data Build clean and efficient data models using data wrangling techniques Remove outliers and calculate standard deviation to gauge the skewness of data Who This Book Is For This book is for data analysts looking for effective hands-on methods to manage and analyze large volumes of data using SQL. The book will also benefit data scientists, product managers, and basically any role wherein you are expected to gather data insights and develop business strategies using SQL as a language. If you are new to or have basic knowledge of SQL and databases and an understanding of data cleaning practices, this book will give you further insights into how you can apply SQL concepts to build clean, standardized data models for accurate analysis.

Author(s): Raghav Kandarpa | Shivangi Saxena

Edition: 1
Publisher: Packt Publishing Pvt Ltd
Year: 2023

Language: English
Pages: 1196

Data Wrangling with SQL
Acknowledgements
Dedication
Contributors
About the authors
About the reviewers
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Conventions used
Get in touch
Share Your Thoughts
Download a free PDF copy of this book
Part 1:Data Wrangling Introduction
1
Database Introduction
Getting started
Establishing the foundation
Efficient data organization
Data integrity and consistency
Technical requirements
Decoding database structures – relational and non-relational
What is a database?
Types of databases
Tables and relationships
The SQL CREATE DATABASE statement
The SQL CREATE TABLE statement
SQL DROP TABLE versus TRUNCATE TABLE
SQL ALTER TABLE
SQL constraints
SQL keys
Database relationships
Comparing database normalization and denormalization
Normalization
Types of normalization
Denormalization
When to apply denormalization
Disadvantages of denormalization
Summary
Practical exercises
Practical exercise 1
Practical exercise 2
Practical exercise 3
Practical exercise 4
2
Data Profiling and Preparation before Data Wrangling
What is data wrangling?
Data wrangling steps
The importance of data wrangling
Benefits of data wrangling
Data wrangling use cases
Business use cases
Data capture
How does data get captured?
Data-capturing techniques
Web scraping
Structured versus unstructured data
Paid-for versus free data-wrangling tools
Data profiling
Data profiling types
Data profiling techniques
Practical exercise
Step 1 – Discovery
Step 2 – Structuring
Step 3 – Cleaning
Step 4 – Enriching
Step 5 – Validating
Step 6 – Publishing
Summary
Part 2:Data Wrangling Techniques Using SQL
3
Data Wrangling on String Data Types
SQL data types
Numeric data types
Date and time data types
String data type
SQL string functions
RIGHT()
LEFT()
LEN()
TRIM()
RTRIM()
LTRIM()
RPAD()
LPAD()
REPLACE()
REVERSE()
SUBSTRING()
CAST()
CONCATENATE()
CONCATENATE_WS()
UPPER function
LOWER function
INITCAP function
INSTR function
Summary
Practical exercises
Practical exercise 1
Practical exercise 2
Practical exercise 3
Practical exercise 4
4
Data Wrangling on the DATE Data Type
SQL DATE data type functions
EXTRACT
DATEDIFF()
TIMEDIFF()
DATE_ADD()
DATE_SUB()
DATE_FORMAT()
STR_TO_DATE()
Extracting the current date and time
Summary
5
Handling NULL Values
The impact of missing data and NULL values on data analysis
Understanding the importance of data validation and cleaning before analyzing data
Identifying NULL/missing values
NULL values versus zero values
Using the IS NULL and IS NOT NULL operators to filter and select data with NULL values
IS NULL() and IS NOT NULL() – scenario
Using the COALESCE and IFNULL functions to replace NULL values with a default value
IFNULL()
COALESCE()
IS NULL versus = NULL
Summary
6
Pivoting Data Using SQL
SQL Transpose – rows to columns
Use case scenario
SQL Cross Tab – columns to rows
Use case scenario
Unpivoting data in SQL
Analytical workflow – from SQL to business intelligence – transforming data into actionable insights
Summary
Part 3:SQL Subqueries, Aggregate And Window Functions
7
Subqueries and CTEs
Introduction to subqueries
Simple subqueries
Correlated subqueries
Using subqueries in SELECT statements
Using subqueries in FROM statements
Using subqueries in WHERE statements
Nested subqueries
Correlated subqueries
Using subqueries in INSERT, UPDATE, and DELETE statements
Managing and maintaining subqueries
Common table expressions
Performance considerations for subqueries and CTEs
Subquery versus CTEs
Summary
8
Aggregate Functions
Overview of aggregate functions in SQL
Using GROUP BY
COUNT()
SUM()
AVG()
MIN() and MAX()
COUNT(DISTINCT)
Case scenario – using all aggregate functions
Summary
9
SQL Window Functions
The importance of SQL window functions
SQL aggregate functions
SQL window functions versus aggregate functions
Window functions versus aggregate functions – an example to illustrate the differences
Window functions
SUM()
COUNT()
AVG()
ROW_NUMBER()
RANK() and DENSE_RANK()
Lead() and Lag()
NTILE()
Summary
Part 4:Optimizing Query Performance
10
Optimizing Query Performance
Introduction to query optimization
Query execution plan
Query optimization techniques
Example
Caching
Normalization
Query monitoring and troubleshooting
Query profiling
Query logging
Database monitoring
Tips and tricks for writing efficient queries
Summary
In the next chapter, we will learn about descriptive statistics using SQL, which will provide us with insights into the distribution, central tendency, and variability of data, which can, in turn, help us identify outliers and anomalies. Common SQL functions and statements used for descriptive statistics include COUNT, AVG, MIN, MAX, and GROUP BY. By using SQL to analyze data, researchers and analysts can efficiently extract and summarize information from large datasets.
Part 5:Data Science And Wrangling
11
Descriptive Statistics with SQL
Calculating descriptive statistics with SQL
Mean
Median
Mode
Standard deviation
Variance
Variability
Summary
In the next chapter, we will learn how SQL can be used for time series analysis.
12
Time Series with SQL
Running totals
Case scenario
Lead and lag for time series analysis
Case scenario
Key KPIs
Percentage change
Case scenario
Key KPIs
Moving averages
Case scenario
Key KPIs
Rank for time series analysis
Case scenario
Key KPIs
CTE for time series analysis
Importance of using CTEs while performing time series analysis
Forecasting with linear regression
Case scenario
Key KPIs
Summary
In the next chapter, we will learn different methods to find outliers in the data easily. Outlier detection is an important aspect of data analysis as it helps determine if the data is correct, looks at the skewness of the data, and removes any unexpected values.
13
Outlier Detection
Measures of central tendency and dispersion
Case scenario
Key KPIs
Methods for detecting outliers
Box plot method
Handling outliers
Case scenario
Key points to keep in mind while handling outliers
Applying outlier detection
Challenges and limitations
Best practices
Summary
Index
Why subscribe?
Other Books You May Enjoy
Packt is searching for authors like you
Share Your Thoughts
Download a free PDF copy of this book