If you’re like most R users, you have deep knowledge and love for statistics. But as your organization continues to collect huge amounts of data, adding tools such as Apache Spark makes a lot of sense. With this practical book, data scientists and professionals working with large-scale data applications will learn how to use Spark from R to tackle big data and big compute problems.
Authors Javier Luraschi, Kevin Kuo, and Edgar Ruiz show you how to use R with Spark to solve different data analysis problems. This book covers relevant data science topics, cluster computing, and issues that should interest even the most advanced users.
- Analyze, explore, transform, and visualize data in Apache Spark with R
- Create statistical models to extract information and predict outcomes; automate the process in production-ready workflows
- Perform analysis and modeling across many machines using distributed computing techniques
- Use large-scale data from multiple sources and different formats with ease from within Spark
- Learn about alternative modeling frameworks for graph processing, geospatial analysis, and genomics at scale
- Dive into advanced topics including custom transformations, real-time data processing, and creating custom Spark extensions
Author(s): Javier Luraschi, Kevin Kuo, Edgar Ruiz
Edition: 1
Publisher: O'Reilly Media
Year: 2019
Language: English
Pages: 296
Tags: Spark; R; Analysis;
Cover
Copyright
Table of Contents
Foreword
Preface
Formatting
Acknowledgments
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Chapter 1. Introduction
Overview
Hadoop
Spark
R
sparklyr
Recap
Chapter 2. Getting Started
Overview
Prerequisites
Installing sparklyr
Installing Spark
Connecting
Using Spark
Web Interface
Analysis
Modeling
Data
Extensions
Distributed R
Streaming
Logs
Disconnecting
Using RStudio
Resources
Recap
Chapter 3. Analysis
Overview
Import
Wrangle
Built-in Functions
Correlations
Visualize
Using ggplot2
Using dbplot
Model
Caching
Communicate
Recap
Chapter 4. Modeling
Overview
Exploratory Data Analysis
Feature Engineering
Supervised Learning
Generalized Linear Regression
Other Models
Unsupervised Learning
Data Preparation
Topic Modeling
Recap
Chapter 5. Pipelines
Overview
Creation
Use Cases
Hyperparameter Tuning
Operating Modes
Interoperability
Deployment
Batch Scoring
Real-Time Scoring
Recap
Chapter 6. Clusters
Overview
On-Premises
Managers
Distributions
Cloud
Amazon
Databricks
Google
IBM
Microsoft
Qubole
Kubernetes
Tools
RStudio
Jupyter
Livy
Recap
Chapter 7. Connections
Overview
Edge Nodes
Spark Home
Local
Standalone
YARN
YARN Client
YARN Cluster
Livy
Mesos
Kubernetes
Cloud
Batches
Tools
Multiple Connections
Troubleshooting
Logging
Spark Submit
Windows
Recap
Chapter 8. Data
Overview
Reading Data
Paths
Schema
Memory
Columns
Writing Data
Copying Data
File Formats
CSV
JSON
Parquet
Others
File Systems
Storage Systems
Hive
Cassandra
JDBC
Recap
Chapter 9. Tuning
Overview
Graph
Timeline
Configuring
Connect Settings
Submit Settings
Runtime Settings
sparklyr Settings
Partitioning
Implicit Partitions
Explicit Partitions
Caching
Checkpointing
Memory
Shuffling
Serialization
Configuration Files
Recap
Chapter 10. Extensions
Overview
H2O
Graphs
XGBoost
Deep Learning
Genomics
Spatial
Troubleshooting
Recap
Chapter 11. Distributed R
Overview
Use Cases
Custom Parsers
Partitioned Modeling
Grid Search
Web APIs
Simulations
Partitions
Grouping
Columns
Context
Functions
Packages
Cluster Requirements
Installing R
Apache Arrow
Troubleshooting
Worker Logs
Resolving Timeouts
Inspecting Partitions
Debugging Workers
Recap
Chapter 12. Streaming
Overview
Transformations
Analysis
Modeling
Pipelines
Distributed R
Kafka
Shiny
Recap
Chapter 13. Contributing
Overview
The Spark API
Spark Extensions
Using Scala Code
Recap
Appendix A. Supplemental Code References
Preface
Formatting
Chapter 1
The World’s Capacity to Store Information
Daily Downloads of CRAN Packages
Chapter 2
Prerequisites
Chapter 3
Hive Functions
Chapter 4
MLlib Functions
Chapter 6
Google Trends for On-Premises (Mainframes), Cloud Computing, and Kubernetes
Chapter 12
Stream Generator
Installing Kafka
Index
About the Authors
Colophon