Azure Storage, Streaming, and Batch Analytics A guide for data engineers

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

The Microsoft Azure cloud is an ideal platform for data-intensive applications. Designed for productivity, Azure provides pre-built services that make collection, storage, and analysis much easier to implement and manage. Azure Storage, Streaming, and Batch Analytics teaches you how to design a reliable, performant, and cost-effective data infrastructure in Azure by progressively building a complete working analytics system. About the technology Microsoft Azure provides dozens of services that simplify storing and processing data. These services are secure, reliable, scalable, and cost efficient. About the book Azure Storage, Streaming, and Batch Analytics shows you how to build state-of-the-art data solutions with tools from the Microsoft Azure platform. Read along to construct a cloud-native data warehouse, adding features like real-time data processing. Based on the Lambda architecture for big data, the design uses scalable services such as Event Hubs, Stream Analytics, and SQL databases. Along the way, you’ll cover most of the topics needed to earn an Azure data engineering certification.

Author(s): Richard L. Nuckolls
Edition: 1
Publisher: Manning Publications
Year: 2020

Language: English
Commentary: Vector PDF
Pages: 448
City: Shelter Island, NY
Tags: Microsoft Azure; Analytics; PowerShell; SQL; Stream Processing; Data Lake; Lambda Architecture; Azure Event Hub; Azure Data Factory; Azure Data Lake Analytics; Azure Stream Analytics; Data Engineering; Azure Storage; Azure Data Lake Storage; U-SQL; Azure SQL Database

Azure Storage, Streaming, and Batch Analytic
brief contents
contents
preface
acknowledgements
about this book
Who should read this book
How this book is organized: a roadmap
About the code
Author online
about the author
about the cover illustration
1 What is data engineering?
1.1 What is data engineering?
1.2 What do data engineers do?
1.3 How does Microsoft define data engineering?
1.3.1 Data acquisition
1.3.2 Data storage
1.3.3 Data processing
1.3.4 Data queries
1.3.5 Orchestration
1.3.6 Data retrieval
1.4 What tools does Azure provide for data engineering?
1.5 Azure Data Engineers
1.6 Example application
Summary
2 Building an analytics system in Azure
2.1 Fundamentals of Azure architecture
2.1.1 Azure subscriptions
2.1.2 Azure regions
2.1.3 Azure naming conventions
2.1.4 Resource groups
2.1.5 Finding resources
2.2 Lambda architecture
2.3 Azure cloud services
2.3.1 Azure analytics system architecture
2.3.2 Event Hubs
2.3.3 Stream Analytics
2.3.4 Data Lake Storage
2.3.5 Data Lake Analytics
2.3.6 SQL Database
2.3.7 Data Factory
2.3.8 Azure PowerShell
2.4 Walk-through of processing a series of event data records
2.4.1 Hot path
2.4.2 Cold path
2.4.3 Choosing abstract Azure services
2.5 Calculating cloud hosting costs
2.5.1 Event Hubs
2.5.2 Stream Analytics
2.5.3 Data Lake Storage
2.5.4 Data Lake Analytics
2.5.5 SQL Database
2.5.6 Data Factory
Summary
3 General storage with Azure Storage accounts
3.1 Cloud storage services
3.1.1 Before you begin
3.2 Creating an Azure Storage account
3.2.1 Using Azure portal
3.2.2 Using Azure PowerShell
3.2.3 Azure Storage replication
3.3 Storage account services
3.3.1 Blob storage
3.3.2 Creating a Blobs service container
3.3.3 Blob tiering
3.3.4 Copy tools
3.3.5 Queues
3.3.6 Creating a queue
3.3.7 Azure Storage queue options
3.4 Storage account access
3.4.1 Blob container security
3.4.2 Designing Storage account access
3.5 Exercises
3.5.1 Exercise 1
3.5.2 Exercise 2
Summary
4 Azure Data Lake Storage
4.1 Create an Azure Data Lake store
4.1.1 Using Azure Portal
4.1.2 Using Azure PowerShell
4.2 Data Lake store access
4.2.1 Access schemes
4.2.2 Configuring access
4.2.3 Hierarchy structure in the Data Lake store
4.3 Storage folder structure and data drift
4.3.1 Hierarchy structure revisited
4.3.2 Data drift
4.4 Copy tools for Data Lake stores
4.4.1 Data Explorer
4.4.2 ADLCopy tool
4.4.3 Azure Storage Explorer tool
4.5 Exercises
4.5.1 Exercise 1
4.5.2 Exercise 2
Summary
5 Message handling with Event Hubs
5.1 How does an Event Hub work?
5.2 Collecting data in Azure
5.3 Create an Event Hubs namespace
5.3.1 Using Azure PowerShell
5.3.2 Throughput units
5.3.3 Event Hub geo-disaster recovery
5.3.4 Failover with geo-disaster recovery
5.4 Creating an Event Hub
5.4.1 Using Azure portal
5.4.2 Using Azure PowerShell
5.4.3 Shared access policy
5.5 Event Hub partitions
5.5.1 Multiple consumers
5.5.2 Why specify a partition?
5.5.3 Why not specify a partition?
5.5.4 Event Hubs message journal
5.5.5 Partitions and throughput units
5.6 Configuring Capture
5.6.1 File name formats
5.6.2 Secure access for Capture
5.6.3 Enabling Capture
5.6.4 The importance of time
5.7 Securing access to Event Hubs
5.7.1 Shared Access Signature policies
5.7.2 Writing to Event Hubs
5.8 Exercises
5.8.1 Exercise 1
5.8.2 Exercise 2
5.8.3 Exercise 3
Summary
6 Real-time queries with Azure Stream Analytics
6.1 Creating a Stream Analytics service
6.1.1 Elements of a Stream Analytics job
6.1.2 Create an ASA job using the Azure portal
6.1.3 Create an ASA job using Azure PowerShell
6.2 Configuring inputs and outputs
6.2.1 Event Hub job input
6.2.2 ASA job outputs
6.3 Creating a job query
6.3.1 Starting the ASA job
6.3.2 Failure to start
6.3.3 Output exceptions
6.4 Writing job queries
6.4.1 Window functions
6.4.2 Machine learning functions
6.5 Managing performance
6.5.1 Streaming units
6.5.2 Event ordering
6.6 Exercises
6.6.1 Exercise 1
6.6.2 Exercise 2
Summary
7 Batch queries with Azure Data Lake Analytics
7.1 U-SQL language
7.1.1 Extractors
7.1.2 Outputters
7.1.3 File selectors
7.1.4 Expressions
7.2 U-SQL jobs
7.2.1 Selecting the biometric data files
7.2.2 Schema extraction
7.2.3 Aggregation
7.2.4 Writing files
7.3 Creating a Data Lake Analytics service
7.3.1 Using Azure portal
7.3.2 Using Azure PowerShell
7.4 Submitting jobs to ADLA
7.4.1 Using Azure portal
7.4.2 Using Azure PowerShell
7.5 Efficient U-SQL job executions
7.5.1 Monitoring a U-SQL job
7.5.2 Analytics units
7.5.3 Vertexes
7.5.4 Scaling the job execution
7.6 Using Blob Storage
7.6.1 Constructing Blob file selectors
7.6.2 Adding a new data source
7.6.3 Filtering rowsets
7.7 Exercises
7.7.1 Exercise 1
7.7.2 Exercise 2
Summary
8 U-SQL for complex analytics
8.1 Data Lake Analytics Catalog
8.1.1 Simplifying U-SQL queries
8.1.2 Simplifying data access
8.1.3 Loading data for reuse
8.2 Window functions
8.3 Local C# functions
8.4 Exercises
8.4.1 Exercise 1
8.4.2 Exercise 2
Summary
9 Integrating with Azure Data Lake Analytics
9.1 Processing unstructured data
9.1.1 Azure Cognitive Services
9.1.2 Managing assemblies in the Data Lake
9.1.3 Image data extraction with Advanced Analytics
9.2 Reading different file types
9.2.1 Adding custom libraries with a Catalog
9.2.2 Creating a catalog database
9.2.3 Building the U-SQL DataFormats solution
9.2.4 Code folders
9.2.5 Using custom assemblies
9.3 Connecting to remote sources
9.3.1 External databases
9.3.2 Credentials
9.3.3 Data Source
9.3.4 Tables and views
9.4 Exercises
9.4.1 Exercise 1
9.4.2 Exercise 2
Summary
10 Service integration with Azure Data Factory
10.1 Creating an Azure Data Factory service
10.2 Secure authentication
10.2.1 Azure Active Directory integration
10.2.2 Azure Key Vault
10.3 Copying files with ADF
10.3.1 Creating a Files storage container
10.3.2 Adding secrets to AKV
10.3.3 Creating a Files storage linkedservice
10.3.4 Creating an ADLS linkedservice
10.3.5 Creating a pipeline and activity
10.3.6 Creating a scheduled trigger
10.4 Running an ADLA job
10.4.1 Creating an ADLA linkedservice
10.4.2 Creating a pipeline and activity
10.5 Exercises
10.5.1 Exercise 1
10.5.2 Exercise 2
Summary
11 Managed SQL with Azure SQL Database
11.1 Creating an Azure SQL Database
11.1.1 Create a SQL Server and SQLDB
11.2 Securing SQLDB
11.3 Availability and recovery
11.3.1 Restoring and moving SQLDB
11.3.2 Database safeguards
11.3.3 Creating alerts for SQLDB
11.4 Optimizing costs for SQLDB
11.4.1 Pricing structure
11.4.2 Scaling SQLDB
11.4.3 Serverless
11.4.4 Elastic Pools
11.5 Exercises
11.5.1 Exercise 1
11.5.2 Exercise 2
11.5.3 Exercise 3
11.5.4 Exercise 4
Summary
12 Integrating Data Factory with SQL Database
12.1 Before you begin
12.2 Importing data with external data sources
12.2.1 Creating a database scoped credential
12.2.2 Creating an external data source
12.2.3 Creating an external table
12.2.4 Importing Blob files
12.3 Importing file data with ADF
12.3.1 Authenticating between ADF and SQLDB
12.3.2 Creating SQL Database linkedservice
12.3.3 Creating datasets
12.3.4 Creating a copy activity and pipeline
12.4 Exercises
12.4.1 Exercise 1
12.4.2 Exercise 2
12.4.3 Exercise 3
Summary
13 Where to go next
13.1 Data catalog
13.1.1 Data Catalog as a service
13.1.2 Data locations
13.1.3 Data definitions
13.1.4 Data frequency
13.1.5 Business drivers
13.2 Version control and backups
13.2.1 Blob Storage
13.2.2 Data Lake Storage
13.2.3 Stream Analytics
13.2.4 Data Lake Analytics
13.2.5 Data Factory configuration files
13.2.6 SQL Database
13.3 Microsoft certifications
13.4 Signing off
Summary
Appendix A—Setting up Azure services through PowerShell
A.1 Setting up Azure PowerShell
A.2 Create a subscription
A.3 Azure naming conventions
A.4 Setting up common Azure resources using PowerShell
A.4.1 Creating a new resource group
A.4.2 Creating a new Azure Active Directory user
A.4.3 Creating a new Azure Active Directory group
A.5 Setting up Azure services using PowerShell
A.5.1 Creating a new Storage account
A.5.2 Creating a new Data Lake store
A.5.3 Create new Event Hub
A.5.4 Create new Stream Analytics job
A.5.5 Create new Data Lake Analytics account
A.5.6 Create new SQL Server and Database
A.5.7 Create a new Data Factory service
A.5.8 Creating a new App registration
A.5.9 Creating a new key vault
A.5.10 Create new SQL Server and Database with lookup data
Appendix B—Configuring the Jonestown Sluggers analytics system
B.1 Solution design
B.1.1 Hot path
B.1.2 Cold path
B.2 Naming convention
B.3 Creation script
B.4 Configure Azure services using PowerShell
B.4.1 Stream Analytics Managed Identity
B.4.2 Data Lake store
B.4.3 Stream Analytics job configuration
B.4.4 SQL Database
B.4.5 Data Factory
B.5 Load event data
B.6 Output of batch and stream processing
B.7 Removing services
index
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Z