Successfully navigating the data-driven economy presupposes a certain understanding of the technologies and methods to gain insights from Big Data. This book aims to help data science practitioners to successfully manage the transition to Big Data. Building on familiar content from applied econometrics and business analytics, this book introduces the reader to the basic concepts of Big Data Analytics. The focus of the book is on how to productively apply econometric and machine learning techniques with large, complex data sets, as well as on all the steps involved before analysing the data (data storage, data import, data preparation). The book combines conceptual and theoretical material with the practical application of the concepts using R and SQL. The reader will thus acquire the skills to analyse large data sets, both locally and in the cloud. Various code examples and tutorials, focused on empirical economic and business research, illustrate practical techniques to handle and analyse Big Data. Key Features: Includes many code examples in R and SQL, with R/SQL scripts freely provided online. Extensive use of real datasets from empirical economic research and business analytics, with data files freely provided online. Leads students and practitioners to think critically about where the bottlenecks are in practical data analysis tasks with large data sets, and how to address them. The book is a valuable resource for data science practitioners, graduate students and researchers who aim to gain insights from big data in the context of research questions in business, economics, and the social sciences.
Author(s): Matter, Ulrich;
Publisher: CRC Press LLC
Year: 2023
Language: English
Pages: 328
Preface
I Setting the Scene: Analyzing Big Data
Introduction
1 What is Big in “Big Data”?
2 Approaches to Analyzing Big Data
3 The Two Domains of Big Data Analytics
3.1 A practical big P problem
3.1.1 Simple logistic regression (naive approach)
3.1.2 Regularization: the lasso estimator
3.2 A practical big N problem
3.2.1 OLS as a point of reference
3.2.2 The Uluru algorithm as an alternative to OLS
II Platform: Software and Computing Resources
Introduction
4 Software: Programming with (Big) Data
4.1 Domains of programming with (big) data
4.2 Measuring R performance
4.3 Writing efficient R code
4.3.1 Memory allocation and growing objects
4.3.2 Vectorization in basic R functions
4.3.3 apply-type functions and vectorization
4.3.4 Avoiding unnecessary copying
4.3.5 Releasing memory
4.3.6 Beyond R
4.4 SQL basics
4.4.1 First steps in SQL(ite)
4.4.2 Joins
4.5 With a little help from my friends: GPT and R/SQL coding
4.6 Wrapping up
5 Hardware: Computing Resources
5.1 Mass storage
5.1.1 Avoiding redundancies
5.1.2 Data compression
5.2 Random access memory (RAM)
5.3 Combining RAM and hard disk: Virtual memory
5.4 CPU and parallelization
5.4.1 Naive multi-session approach
5.4.2 Multi-session approach with futures
5.4.3 Multi-core and multi-node approach
5.5 GPUs for scientific computing
5.5.1 GPUs in R
5.6 The road ahead: Hardware made for machine learning
5.7 Wrapping up
5.8 Still have insufficient computing resources?
6 Distributed Systems
6.1 MapReduce
6.2 Apache Hadoop
6.2.1 Hadoop word count example
6.3 Apache Spark
6.4 Spark with R
6.4.1 Data import and summary statistics
6.5 Spark with SQL
6.6 Spark with R + SQL
6.7 Wrapping up
7 Cloud Computing
7.1 Cloud computing basics and platforms
7.2 Transitioning to the cloud
7.3 Scaling up in the cloud: Virtual servers
7.3.1 Parallelization with an EC2 instance
7.4 Scaling up with GPUs
7.4.1 GPUs on Google Colab
7.4.2 RStudio and EC2 with GPUs on AWS
7.5 Scaling out: MapReduce in the cloud
7.6 Wrapping up
III Components of Big Data Analytics
Introduction
8 Data Collection and Data Storage
8.1 Gathering and compilation of raw data
8.2 Stack/combine raw source files
8.3 Efficient local data storage
8.3.1 RDBMS basics
8.3.2 Efficient data access: Indices and joins in SQLite
8.4 Connecting R to an RDBMS
8.4.1 Creating a new database with RSQLite
8.4.2 Importing data
8.4.3 Issuing queries
8.5 Cloud solutions for (big) data storage
8.5.1 Easy-to-use RDBMS in the cloud: AWS RDS
8.6 Column-based analytics databases
8.6.1 Installation and start up
8.6.2 First steps via Druid's GUI
8.6.3 Query Druid from R
8.7 Data warehouses
8.7.1 Data warehouse for analytics: Google BigQuery example
8.8 Data lakes and simple storage service
8.8.1 AWS S3 with R: First steps
8.8.2 Uploading data to S3
8.8.3 More than just simple storage: S3 + Amazon Athena
8.9 Wrapping up
9 Big Data Cleaning and Transformation
9.1 Out-of-memory strategies and lazy evaluation: Practical basics
9.1.1 Chunking data with the ff package
9.1.2 Memory mapping with bigmemory
9.1.3 Connecting to Apache Arrow
9.2 Big Data preparation tutorial with ff
9.2.1 Set up
9.2.2 Data import
9.2.3 Inspect imported files
9.2.4 Data cleaning and transformation
9.2.5 Inspect difference in in-memory operation
9.2.6 Subsetting
9.2.7 Save/load/export ff files
9.3 Big Data preparation tutorial with arrow
9.4 Wrapping up
10 Descriptive Statistics and Aggregation
10.1 Data aggregation: The ‘split-apply-combine’ strategy
10.2 Data aggregation with chunked data files
10.3 High-speed in-memory data aggregation with arrow
10.4 High-speed in-memory data aggregation with data.table
10.5 Wrapping up
11 (Big) Data Visualization
11.1 Challenges of Big Data visualization
11.2 Data exploration with ggplot2
11.3 Visualizing time and space
11.3.1 Preparations
11.3.2 Pick-up and drop-off locations
11.4 Wrapping up
IV Application: Topics in Big Data Econometrics
Introduction
12 Bottlenecks in Everyday Data Analytics Tasks
12.1 Case study: Efficient fixed effects estimation
12.2 Case study: Loops, memory, and vectorization
12.2.1 Naïve approach (ignorant of R)
12.2.2 Improvement 1: Pre-allocation of memory
12.2.3 Improvement 2: Exploit vectorization
12.3 Case study: Bootstrapping and parallel processing
12.3.1 Parallelization with an EC2 instance
13 Econometrics with GPUs
13.1 OLS on GPUs
13.2 A word of caution
13.3 Higher-level interfaces for basic econometrics with GPUs
13.4 TensorFlow/Keras example: Predict housing prices
13.4.1 Data preparation
13.4.2 Model specification
13.4.3 Training and prediction
13.5 Wrapping up
14 Regression Analysis and Categorization with Spark and R
14.1 Simple linear regression analysis
14.2 Machine learning for classification
14.3 Building machine learning pipelines with R and Spark
14.3.1 Set up and data import
14.3.2 Building the pipeline
14.4 Wrapping up
15 Large-scale Text Analysis with sparklyr
15.1 Getting started: Import, pre-processing, and word count
15.2 Tutorial: political slant
15.2.1 Data download and import
15.2.2 Cleaning speeches data
15.2.3 Create a bigrams count per party
15.2.4 Find “partisan” phrases
15.2.5 Results: Most partisan phrases by congress
15.3 Natural Language Processing at Scale
15.3.1 Preparatory steps
15.3.2 Sentiment annotation
15.4 Aggregation and visualization
15.5 sparklyr and lazy evaluation
V Appendices
Appendix A: GitHub
Appendix B: R Basics
Appendix C: Install Hadoop
VI References and Index
Bibliography
Index