Learning Apache Drill: Query and Analyze Distributed Data Sources with SQL

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Get up to speed with Apache Drill, an extensible distributed SQL query engine that reads massive datasets in many popular file formats such as Parquet, JSON, and CSV. Drill reads data in HDFS or in cloud-native storage such as S3 and works with Hive metastores along with distributed databases such as HBase, MongoDB, and relational databases. Drill works everywhere: on your laptop or in your largest cluster.

In this practical book, Drill committers Charles Givre and Paul Rogers show analysts and data scientists how to query and analyze raw data using this powerful tool. Data scientists today spend about 80% of their time just gathering and cleaning data. With this book, you’ll learn how Drill helps you analyze data more effectively to drive down time to insight.

  • Use Drill to clean, prepare, and summarize delimited data for further analysis
  • Query file types including logfiles, Parquet, JSON, and other complex formats
  • Query Hadoop, relational databases, MongoDB, and Kafka with standard SQL
  • Connect to Drill programmatically using a variety of languages
  • Use Drill even with challenging or ambiguous file formats
  • Perform sophisticated analysis by extending Drill’s functionality with user-defined functions
  • Facilitate data analysis for network security, image metadata, and machine learning

Author(s): Charles Givre, Paul Rogers
Edition: 1
Publisher: O'Reilly Media
Year: 2018

Language: English
Commentary: Revision History for the First Edition: 2018-10-29: First Release
Pages: 332
City: Sebastopol, CA
Tags: Apache Hadoop; Apache Drill; File Organization; Querying; SQL; Big Data

Cover
Copyright
Table of Contents
Preface
Who Should Read This Book
Why We Wrote This Book
Navigating This Book
Online Resources
Conventions Used in This Book
Using Code Examples
O’Reilly Safari
How to Contact Us
Acknowledgments
Special Thanks from Charles
Special Thanks from Paul
Chapter 1. Introduction to Apache Drill
What Is Apache Drill?
Drill Is Versatile
Drill Is Easy to Use
A Word About Drill’s Performance
A Very Brief History of Big Data
Drill in the Big Data Ecosystem
Comparing Drill with Similar Tools
Chapter 2. Installing and Running Drill
Preparing Your Machine for Drill
Special Configuration Instructions for Windows Installations
Installing Drill on Windows
Starting Drill on a Windows Machine
Installing Drill in Embedded Mode on macOS or Linux
Starting Drill on macOS or Linux in Embedded Mode
Installing Drill in Distributed Mode on macOS or Linux
Preparing Your Cluster for Drill
Starting Drill in Distributed Mode
Connecting to the Cluster
Conclusion
Chapter 3. Overview of Apache Drill
The Apache Hadoop Ecosystem
Drill Is a Low-Latency Query Engine
Distributed Processing with HDFS
Elements of a Drill System
Drill Operation: The 30,000-Foot View
Drill Is a Query Engine, Not a Database
Drill Operation Overview
Drill Components
SQL Session State
Statement Preparation
Statement Execution
Low-Latency Features
Conclusion
Chapter 4. Querying Delimited Data
Ways of Querying Data with Drill
Other Interfaces
Drill SQL Query Format
Choosing a Data Source
Defining a Workspace
Specifying a Default Data Source
Accessing Columns in a Query
Delimited Data with Column Headers
Table Functions
Querying Directories
Understanding Drill Data Types
Cleaning and Preparing Data Using String Manipulation Functions
Complex Data Conversion Functions
Working with Dates and Times in Drill
Converting Strings to Dates
Reformatting Dates
Date Arithmetic and Manipulation
Date and Time Functions in Drill
Creating Views
Data Analysis Using Drill
Summarizing Data with Aggregate Functions
Common Problems in Querying Delimited Data
Spaces in Column Names
Illegal Characters in Column Headers
Reserved Words in Column Names
Conclusion
Chapter 5. Analyzing Complex and Nested Data
Arrays and Maps
Arrays in Drill
Accessing Maps (Key–Value Pairs) in Drill
Querying Nested Data
Analyzing Log Files with Drill
Configuring Drill to Read HTTPD Web Server Logs
Querying Web Server Logs
Other Log Analysis with Drill
Conclusion
Chapter 6. Connecting Drill to Data Sources
Querying Multiple Data Sources
Configuring a New Storage Plug-in
Connecting Drill to a Relational Database
Querying Data in Hadoop from Drill
Connecting to and Querying HBase from Drill
Querying Hive Data from Drill
Connecting to and Querying Streaming Data with Drill and Kafka
Connecting to and Querying Kudu
Connecting to and Querying MongoDB from Drill
Connecting Drill to Cloud Storage
Querying Time Series Data from Drill and OpenTSDB
Conclusion
Chapter 7. Connecting to Drill
Understanding Drill’s Interfaces
JDBC and Drill
ODBC and Drill
Drill’s REST Interface
Connecting to Drill with Python
Using drillpy to Query Drill
Connecting to Drill Using pydrill
Other Ways of Connecting to Drill from Python
Connecting to Drill Using R
Querying Drill from R Using sergeant
Connecting to Drill Using Java
Querying Drill with PHP
Using the Connector
Querying Drill from PHP
Interacting with Drill from PHP
Querying Drill Using Node.js
Using Drill as a Data Source in BI Tools
Exploring Data with Apache Zeppelin and Drill
Exploring Data with Apache Superset
Conclusion
Chapter 8. Data Engineering with Drill
Schema-on-Read
The SQL Relational Model
Data Life Cycle: Data Exploration to Production
Schema Inference
Data Source Inference
Storage Plug-ins
Storage Configurations
Workspaces
Querying Directories
Default Schema
File Type Inference
Format Plug-ins and Format Configuration
Format Inference
File Format Variations
Schema Inference Overview
Distributed File Scans
Schema Inference for Delimited Data
CSV Summary
Schema Inference for JSON
Ambiguous Numeric Schemas
Aligning Schemas Across Files
JSON Objects
JSON Lists in Drill
JSON Summary
Using Drill with the Parquet File Format
Schema Evolution in Parquet
Partitioning Data Directories
Defining a Table Workspace
Working with Queries in Production
Capturing Schema Mapping in Views
Running Challenging Queries in Scripts
Conclusion
Chapter 9. Deploying Drill in Production
Installing Drill
Prerequisites
Production Installation
Configuring ZooKeeper
Configuring Memory
Configuring Logging
Testing the Installation
Distributing Drill Binaries and Configuration
Starting the Drill Cluster
Configuring Storage
Working with Apache Hadoop HDFS
Working with Amazon S3
Admission Control
Additional Configuration
User-Defined Functions and Custom Plug-ins
Security
Logging Levels
Controlling CPU Usage
Monitoring
Monitoring the Drill Process
Monitoring JMX Metrics
Monitoring Queries
Other Deployment Options
MapR Installer
Drill-on-YARN
Docker
Conclusion
Chapter 10. Setting Up Your Development Environment
Installing Maven
Creating the Drill Build Environment
Setting Up Git and Getting the Source Code
Building Drill from Source
Installing the IDE
Conclusion
Chapter 11. Writing Drill User-Defined Functions
Use Case: Finding and Filtering Valid Credit Card Numbers
How User-Defined Functions Work in Drill
Structure of a Simple Drill UDF
The pom.xml File
The Function File
The Simple Function API
Putting It All Together
Building and Installing Your UDF
Statically Installing a UDF
Dynamically Installing a UDF
Complex Functions: UDFs That Return Maps or Arrays
Example: Extracting User Agent Metadata
The ComplexWriter
Writing Aggregate User-Defined Functions
The Aggregate Function API
Example Aggregate UDF: Kendall’s Rank Correlation Coefficient
Conclusion
Chapter 12. Writing a Format Plug-in
The Example Regex Format Plug-in
Creating the “Easy” Format Plug-in
Creating the Maven pom.xml File
Creating the Plug-in Package
Drill Module Configuration
Format Plug-in Configuration
Cautions Before Getting Started
Creating the Regex Plug-in Configuration Class
Copyright Headers and Code Format
Testing the Configuration
Fixing Configuration Problems
Troubleshooting
Creating the Format Plug-in Class
Creating a Test File
Configuring RAT
Efficient Debugging
Creating the Unit Test
How Drill Finds Your Plug-in
The Record Reader
Testing the Reader Shell
Logging
Error Handling
Setup
Regex Parsing
Defining Column Names
Projection
Column Projection Accounting
Project None
Project All
Project Some
Opening the File
Record Batches
Drill’s Columnar Structure
Defining Vectors
Reading Data
Loading Data into Vectors
Releasing Resources
Testing the Reader
Testing the Wildcard Case
Testing Explicit Projection
Testing Empty Projection
Scaling Up
Additional Details
File Chunks
Default Format Configuration
Next Steps
Production Build
Contributing to Drill: The Pull Request
Maintaining Your Branch
Create a Plug-In Project
Conclusion
Chapter 13. Unique Uses of Drill
Finding Photos Taken Within a Geographic Region
Drilling Excel Files
The pom.xml File
The Excel Custom Record Reader
Using the Excel Format Plug-in
Network Packet Analysis (PCAP) with Drill
Examples of Queries Using PCAP Data Files
Analyzing Twitter Data with Drill
Using Drill in a Machine Learning Pipeline
Making Predictions Within Drill
Building and Serializing a Model
Writing the UDF Wrapper
Making Predictions Using the UDF
Conclusion
Appendix A. List of Drill Functions
Aggregate and Window Functions
Window Functions
Cryptological and Hashing Functions
Data Conversion Functions
Geospatial Functions
Math and Trigonometric Functions
Networking Functions
Null Handling Functions
String Manipulation Functions
Approximate String Matching Functions
Phonetic Functions
String Distance Functions
Appendix B. Drill Formatting Strings
Index
About the Authors
Colophon