This accessible and classroom-tested textbook/reference presents an introduction to the fundamentals of the emerging and interdisciplinary field of data science. The coverage spans key concepts adopted from statistics and machine learning, useful techniques for graph analysis and parallel programming, and the practical application of data science for such tasks as building recommender systems or performing sentiment analysis. Topics and features: provides numerous practical case studies using real-world data throughout the book; supports understanding through hands-on experience of solving data science problems using Python; describes techniques and tools for statistical analysis, machine learning, graph analysis, and parallel programming; reviews a range of applications of data science, including recommender systems and sentiment analysis of text data; provides supplementary code resources and data at an associated website.
Author(s): Laura Igual; Santi SeguĂ
Publisher: Springer
Year: 2017
Language: English
Pages: 220
Preface
Subject Area of the Book
Organization and Feature of the Book
Target Audiences
Previous Uses of the Materials
Suggested Uses of the Book
Supplemental Resources
Acknowledgements
Contents
Authors and Contributors
1 Introduction to Data Science
1.1 What is Data Science?
1.2 About This Book
2 Toolboxes for Data Scientists
2.1 Introduction
2.2 Why Python?
2.3 Fundamental Python Libraries for Data Scientists
2.3.1 Numeric and Scientific Computation: NumPy and SciPy
2.3.2 SCIKIT-Learn: Machine Learning in Python
2.3.3 PANDAS: Python Data Analysis Library
2.4 Data Science Ecosystem Installation
2.5 Integrated Development Environments (IDE)
2.5.1 Web Integrated Development Environment (WIDE): Jupyter
2.6 Get Started with Python for Data Scientists
2.6.1 Reading
2.6.2 Selecting Data
2.6.3 Filtering Data
2.6.4 Filtering Missing Values
2.6.5 Manipulating Data
2.6.6 Sorting
2.6.7 Grouping Data
2.6.8 Rearranging Data
2.6.9 Ranking Data
2.6.10 Plotting
2.7 Conclusions
3 Descriptive Statistics
3.1 Introduction
3.2 Data Preparation
3.2.1 The Adult Example
3.3 Exploratory Data Analysis
3.3.1 Summarizing the Data
3.3.2 Data Distributions
3.3.3 Outlier Treatment
3.3.4 Measuring Asymmetry: Skewness and Pearson's Median Skewness Coefficient
3.3.5 Continuous Distribution
3.3.6 Kernel Density
3.4 Estimation
3.4.1 Sample and Estimated Mean, Variance and Standard Scores
3.4.2 Covariance, and Pearson's and Spearman's Rank Correlation
3.5 Conclusions
4 Statistical Inference
4.1 Introduction
4.2 Statistical Inference: The Frequentist Approach
4.3 Measuring the Variability in Estimates
4.3.1 Point Estimates
4.3.2 Confidence Intervals
4.4 Hypothesis Testing
4.4.1 Testing Hypotheses Using Confidence Intervals
4.4.2 Testing Hypotheses Using p-Values
4.5 But Is the Effect E Real?
4.6 Conclusions
5 Supervised Learning
5.1 Introduction
5.2 The Problem
5.3 First Steps
5.4 What Is Learning?
5.5 Learning Curves
5.6 Training, Validation and Test
5.7 Two Learning Models
5.7.1 Generalities Concerning Learning Models
5.7.2 Support Vector Machines
5.7.3 Random Forest
5.8 Ending the Learning Process
5.9 A Toy Business Case
5.10 Conclusion
6 Regression Analysis
6.1 Introduction
6.2 Linear Regression
6.2.1 Simple Linear Regression
6.2.2 Multiple Linear Regression and Polynomial Regression
6.2.3 Sparse Model
6.3 Logistic Regression
6.4 Conclusions
7 Unsupervised Learning
7.1 Introduction
7.2 Clustering
7.2.1 Similarity and Distances
7.2.2 What Constitutes a Good Clustering? Defining Metrics to Measure Clustering Quality
7.2.3 Taxonomies of Clustering Techniques
7.3 Case Study
7.4 Conclusions
8 Network Analysis
8.1 Introduction
8.2 Basic Definitions in Graphs
8.3 Social Network Analysis
8.3.1 Basics in NetworkX
8.3.2 Practical Case: Facebook Dataset
8.4 Centrality
8.4.1 Drawing Centrality in Graphs
8.4.2 PageRank
8.5 Ego-Networks
8.6 Community Detection
8.7 Conclusions
9 Recommender Systems
9.1 Introduction
9.2 How Do Recommender Systems Work?
9.2.1 Content-Based Filtering
9.2.2 Collaborative Filtering
9.2.3 Hybrid Recommenders
9.3 Modeling User Preferences
9.4 Evaluating Recommenders
9.5 Practical Case
9.5.1 MovieLens Dataset
9.5.2 User-Based Collaborative Filtering
9.6 Conclusions
10 Statistical Natural Language Processing for Sentiment Analysis
10.1 Introduction
10.2 Data Cleaning
10.3 Text Representation
10.3.1 Bi-Grams and n-Grams
10.4 Practical Cases
10.5 Conclusions
11 Parallel Computing
11.1 Introduction
11.2 Architecture
11.2.1 Getting Started
11.2.2 Connecting to the Cluster (The Engines)
11.3 Multicore Programming
11.3.1 Direct View of Engines
11.3.2 Load-Balanced View of Engines
11.4 Distributed Computing
11.5 A Real Application: New York Taxi Trips
11.5.1 A Direct View Non-Blocking Proposal
11.5.2 Results
11.6 Conclusions
Index