Querying Databricks with Spark SQL: Leverage SQL to query and analyze Big Data for insights

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

A practical guide to using Spark SQL to perform complex queries on your Databricks data Description Databricks stands out as a widely embraced platform dedicated to the creation of data lakes. Within its framework, it extends support to a specialized version of Structured Query Language (SQL) known as Spark SQL. If you are interested in learning more about how to use Spark SQL to analyze data in a data lake, then this book is for you. The book covers everything from basic queries to complex data-processing tasks. It begins with an introduction to SQL and Spark. It then covers the basics of SQL, including data types, operators, and clauses. The next few chapters focus on filtering, aggregation, and calculation. Additionally, it covers dates and times, formatting output, and using logic in your queries. It also covers joining tables, subqueries, derived tables, and common table expressions. Additionally, it discusses correlated subqueries, joining and filtering datasets, using SQL in calculations, segmenting and classifying data, rolling analysis, and analyzing data over time. The book concludes with a chapter on advanced data presentation. By the end of the book, you will be able to use Spark SQL to perform complex data analysis tasks on data lakes. What you will learn ● Use Spark SQL to read data from a data lake. ● Learn how to filter, aggregate, and calculate data using Spark SQL. ● Learn how to join tables, use subqueries, and create derived tables in Spark SQL. ● Present data in a visually appealing way using Spark SQL. Who this book is for This book is for anyone who wants to learn how to use SQL to analyze big data. Whether you are a data analyst, student, database developer, accountant, business analyst, data scientist, or anyone else who needs to extract insights from large datasets, this book will teach you the skills you need to get the job done.

Author(s): Adam Aspin
Publisher: BPB Online
Year: 2023

Language: English
Pages: 638

Cover
Title Page
Copyright Page
About the Author
About the Reviewer
Acknowledgement
Preface
Table of Contents
1. Writing Basic SQL Queries
Introduction
Structure
Objectives
Databricks
The Databricks Web interface
Getting started with Databricks notebooks
Cells
Spark SQL
Databricks databases (schemas) and tables
Activating a database
Displaying the data in a table
How does it work
Help writing SQL
Limiting the number of records displayed
How does it work
Displaying data from a specific field
How does it work
Finding the columns in a table
SQL writing style
Displaying data from a specific set of fields
How does it work
Columns or fields
Modifying the field name in the output using aliases
How does it work
Removing duplicates from query output
How does it work
Undo and redo
Sorting data
How does it work
Sorting data in reverse order
How does it work
Applying multiple sort criteria
How does it work
Running SQL queries
Displaying the available tables
Finding all the views in a database
Resizing book cells
Overriding the 1,000-row output limit
Conclusion
2. Filtering Data
Introduction
Structure
Objectives
Filtering text
How does it work
Applying multiple text filters
How does it work
SQL writing style
Excluding an element
How does it work
Using multiple exclusion filters
How does it work
Filtering numbers over a defined threshold
How does it work
Filtering numbers under a defined threshold
How does it work
Filtering on values up to and including a specific number
How does it work
Filtering on a range of values
How does it work
Using Boolean filters (true or false)
How does it work
Conclusion
3. Applying Complex Filters to Queries
Introduction
Structure
Objectives
Using either/or filters
How does it work
Using multiple separate criteria concurrently
How does it work
Using multiple filters and an exclusion
How does it work
Filtering on both text and numbers simultaneously
How does it work
Applying complex alternative filters at the same time
How does it work
Removing case-sensitivity in filters
How does it work
Using wildcard searches
How does it work
Forcing case-insensitivity in wildcard filters
How does it work
Using wildcards to exclude data
How does it work
Applying alternative wildcard patterns
How does it work
Using a specific part of the text to filter data
How does it work
Filtering NULLs or nonexistent data
How does it work
Searching using regular expressions
How does it work
Conclusion
4. Simple Calculations
Introduction
Structure
Objectives
Doing simple math
How does it work
Examining data types in SQL tables and views
Isolating sections of formulas when applying math
How does it work
Calculating ratios
How does it work
Increasing values by a defined percentage
How does it work
Ordering the output of calculations
How does it work
NULLs
Handling missing data
How does it work
Filtering on a calculation
How does it work
Using complex calculated filters
How does it work
Operator precedence
Conclusion
5. Aggregating Output
Introduction
Structure
Objectives
Calculating table totals
How does it work
Using calculated aggregations
How does it work
Using grouped aggregations
How does it work
Using multiple levels of grouping
How does it work
Calculating averages
How does it work
Counting grouped elements
How does it work
Counting unique elements
How does it work
Displaying upper and lower numeric thresholds
How does it work
Aggregating across multiple columns
How does it work
Aggregating across columns and rows
How does it work
Filtering groups
How does it work
Filtering on aggregated results
How does it work
Selecting data based on aggregated results as well as specific filter criteria
How does it work
Query analysis
Sorting by aggregated results
How does it work
Finding elements where every Boolean value is true
How does it work
Conclusion
6. Working with Dates in Databricks
Introduction
Structure
Objectives
Dates in Databricks
Filtering records by date
How does it work
Date datatypes in tables
Using a range of dates to filter data
How does it work
Finding the number of days between two dates
How does it work
Aggregating data over a date range
How does it work
Filtering by year
How does it work
Filtering records over a series of years
How does it work
Find sales for a specific day of the month
How does it work
Isolating data for a specific year and month
How does it work
Finding data for a given quarter
How does it work
Filtering data by weekday
How does it work
Finding records for a specific week of the year
How does it work
Aggregating data by the day of the week in a given year
How does it work
Analyzing data by day of year
How does it work
Grouping data by the full weekday
How does it work
Displaying cumulative data over a period of months to a specific date
How does it work
Displaying cumulative data over 90 days up to a specific date
How does it works
Displaying the data for the previous three months
How does it work
Finding the current system date
Conclusion
7. Formatting Text in Query Output
Introduction
Structure
Objectives
Adding text to the output
How does it work
Adding text to numbers
How does it work
Amalgamating columns
How does it work
Avoiding NULLs in text-based data
How does it work
Concatenating and grouping
How does it work
Adding multiple pieces of text to numbers
How does it work
Converting text to uppercase
How does it work
Converting text to lowercase
How does it work
Converting text to initial capitals
Extracting the first few characters from a field
How does it work
Displaying the three characters at the right of the text
How does it work
Displaying a given number of characters at a specific place in the text
How does it work
Replacing NULLs with the contents of another field
How does it work
Filtering records based on the part of a field
How does it work
Filtering data using specific characters at a given position inside a field
How does it work
Conclusion
8. Formatting Numbers and Dates
Introduction
Structure
Objectives
Removing the decimals from the output
How does it work
Rounding a field up to the nearest whole number
How does it work
Rounding a value to the nearest whole number
How does it work
Rounding to a specific number of decimals
Rounding a value up or down to the nearest thousand
How does it work
Banker’s rounding
Displaying a value in a specific numeric format
How does it work
Displaying a value in a specific currency
How does it work
Outputting a date in a specific date format
How does it work
Presenting the time in a specific format
Conclusion
9. Using Basic Logic to Enhance Analysis
Introduction
Structure
Objectives
Generating an alert when a value is too high
How does it work
Shortening text and adding ellipses to indicate truncation
How does it work
Designing complex calculated alerts
How does it work
Creating key performance indicators
How does it work
Classifying a series of elements without the necessary categories present in your data
How does it work
Creating ad hoc category groupings
How does it work
Applying multiple ad hoc categories
How does it work
Categorizing data using nested classifications
How does it work
Choosing elements from a list
How does it work
Placing nulls at the start or end of a list
How does it work
Conclusion
10. Using Multiple Tables When Querying Data
Introduction
Structure
Objectives
Joining tables
How does it work
Other join syntax
Joining multiple tables
How does it work
Join fields
Join subtleties
How does it work
Using table aliases
How does it work
Joining many tables
How does it work
Visualizing databases
Querying across databases
Conclusion
11. Using Advanced Table Joins
Introduction
Structure
Objectives
Filtering data using inner joins
Filtering data using multiple tables join
How does it work
Semi joins
How does it work
Filtering data output using intermediate tables
How does it work
Using left joins to return all the data in one table but not from the other table
How does it work
Right joins to return all the data in one table but not from the other
How does it work
Full joins to return all the data from both tables in a join
How does it work
Intermediate table joins
How does it work
Using multiple fields in joins
How does it work
Joining a table to itself
How does it work
Joining tables on ranges of values
How does it work
Cross joins
How does it work
Join concepts
Conclusion
12. Subqueries
Introduction
Structure
Objectives
Adding aggregated fields to detailed datasets
How does it work
Displaying a value as the percentage of a total
How does it work
Using a subquery to filter data
How does it work
Using a subquery as part of a calculation to filter data
How does it work
Filtering on an aggregated range of data using multiple subqueries
How does it work
Filtering on aggregated output using a second aggregation
How does it work
Using multiple results from a subquery to filter data
How does it work
Complex aggregated subqueries
How does it work
Nested subqueries
How does it work
Using subqueries to exclude data
How does it work
Filtering across queries and subqueries
How does it work
Applying separate filters to the Subquery and the main query
How does it work
Conclusion
13. Derived Tables
Introduction
Structure
Objectives
Using a derived table to create intermediate calculations
How does it work
Grouping and ordering data using a custom classification
How does it work
Joining derived tables with other tables
How does it work
Joining multiple derived tables
How does it work
Using multiple derived tables for complex aggregations
How does it work
Data visibility
Using derived tables to join unconnected tables
How does it work
Compare year-on-year data using a derived table
How does it work
Synchronizing filters between a derived table and the main query
How does it work
Conclusion
14. Common Table Expressions
Introduction
Structure
Objectives
Simplifying complex queries with CTEs
A basic common table expression
How does it work
Calculating averages across multiple values using a CTE
How does it work
CTE or derived table?
Reusing CTEs in a query
How does it work
Using a CTE in a derived table to deliver two different levels of aggregation
How does it work
Using a CTE to isolate data from a separate dataset at a different level of detail
How does it work
Multiple common table expressions in a single query
How does it work
Nested CTEs
How does it work
Using multiple CTEs to compare disparate datasets
How does it work
Conclusion
15. Correlated Subqueries
Introduction
Structure
Objectives
Simple correlated subqueries
How does it work
Correlated subqueries to display percentages of a specific total
How does it work
Comparing datasets using a correlated subquery
How does it work
Avoid correlated subqueries in certain cases
Duplicating the output of a correlated subquery in the query results
How does it work
Using correlated subqueries to filter data on an aggregate value
How does it work
What makes a query correlated
Using correlated subqueries to detect if records exist
How does it work
Using a correlated subquery to exclude data
How does it work
Conclusion
16. Dataset Manipulation
Introduction
Structure
Objectives
Read data from multiple identical tables using the UNION operator
How does it work
Isolate identical data in multiple tables using the INTERSECT operator
How does it work
Isolating nonidentical records using the EXCEPT operator
How does it work
Joining multiple identical tables in a subquery
How does it work
Conclusion
17. Using SQL for More Advanced Calculations
Introduction
Structure
Objectives
Calculating the percentage represented by each record in a dataset
How does it work
Replacing multiple subqueries
How does it work
Remove decimals in calculations
How does it work
Numeric datatypes
Converting between numeric datatypes
How does it work
Avoiding divide-by-zero errors
How does it work
Finding the remainder in a division using the modulo function
How does it work
Creating financial calculations
How does it work
Using a tally table to produce a sequential list of numbers
How does it work
Generating completely random sample output from a dataset
How does it work
Handling source data where figures are stored as text
How does it work
Conclusion
18. Segmenting and Classifying Data
Introduction
Structure
Objectives
Organizing data by rank
How does it work
Creating multiple groups of rankings
How does it work
Creating multiple ranked groups and subgroups
How does it work
Filtering data by ranked items
How does it work
Classifying data by strict order of rank
How does it work
Segment data into deciles
How does it work
Plot values for a percentile
How does it work
Extracting data from a specific quintile
How does it work
Display median values
How does it work
Conclusion
19. Rolling Analysis
Introduction
Structure
Objectives
Adding a running total
How does it work
Using windowing functions in an aggregated query
How does it work
Grouping running totals
How does it work
Applying windowing functions to a subquery
How does it work
Adding unique ids on the fly using ROW_NUMBER()
How does it work
Displaying records for missing data
How does it work
Displaying a complete range of dates and relevant data
How does it work
Comparing data with the data from a previous record
How does it work
Comparing data over time using the FIRST_VALUE() and LAST_VALUE() functions
How does it work
Displaying rolling averages over a specified number of records
How does it work
Show the first sale and last four sales per client
How does it work
Calculating cumulative distribution
How does it work
Classifying data using the PERCENT_RANK() function
How does it work
Using the LAG() function with alphabetical data
How does it work
Conclusion
20. Analyzing Data Over Time
Introduction
Structure
Objectives
Aggregating values for the year to date
How does it work
Isolating data for the previous month
How does it work
Using a derived table to compare data with values from a previous year
How does it work
Finding the total amount for sales each weekday over a year
How does it work
Count the number of weekend days between two dates
How does it work
Aggregate data for the last day of the month
How does it work
Aggregate data for the last Friday of the month
How does it work
Analyzing timespans as years, months, and days
How does it work
Isolate time periods from date and time data
How does it work
Listing data by time of day
How does it work
Aggregating data by hourly bandings
How does it work
Aggregate data by a quarter of an hour
How does it work
Reading dates and times stored as strings
How does it work
Conclusion
21. Complex Data Output
Introduction
Structure
Objectives
Creating a pivot table
How does it work
Creating a pivot table displaying multiple row groupings
How does it work
Adding totals to aggregate queries
How does it work
Creating subtotals and totals in aggregated queries
How does it work
Creating clear tables that include totals and subtotals
How does it work
Replace acronyms with full text in the final output
How does it work
Conclusion
Index