Data Science Concepts and Techniques with Applications

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

This textbook comprehensively covers both fundamental and advanced topics related to data science. Data science is an umbrella term that encompasses data analytics, data mining, machine learning, and several other related disciplines.



The chapters of this book are organized into three parts: The first part (chapters 1 to 3) is a general introduction to data science. Starting from the basic concepts, the book will highlight the types of data, its use, its importance and issues that are normally faced in data analytics, followed by presentation of a wide range of applications and widely used techniques in data science. The second part, which has been updated and considerably extended compared to the first edition, is devoted to various techniques and tools applied in data science. Its chapters 4 to 10 detail data pre-processing, classification, clustering, text mining, deep learning, frequent pattern mining, and regression analysis. Eventually, the third part (chapters 11 and 12) present a brief introduction to Python and R, the two main data science programming languages, and shows in a completely new chapter practical data science in the WEKA (Waikato Environment for Knowledge Analysis), an open-source tool for performing different machine learning and data mining tasks. An appendix explaining the basic mathematical concepts of data science completes the book.

This textbook is suitable for advanced undergraduate and graduate students as well as for industrial practitioners who carry out research in data science. They both will not only benefit from the comprehensive presentation of important topics, but also from the many application examples and the comprehensive list of further readings, which point to additional publications providing more in-depth research results or provide sources for a more detailed description of related topics. 

"This book delivers a systematic, carefully thoughtful material on Data Science." from the Foreword by Witold Pedrycz, U Alberta, Canada.

Author(s): Usman Qamar, Muhammad Summair Raza
Edition: 2
Publisher: Springer
Year: 2023

Language: English
Pages: 491
City: Cham

Foreword
Preface
Organization of the Book
To the Instructor
To the Student
To the Professional
Contents
About the Authors
Chapter 1: Introduction
1.1 Data
1.2 Analytics
1.3 Big Data vs Small Data
1.4 Role of Data Analytics
1.5 Types of Data Analytics
1.6 Challenges of Data Analytics
1.6.1 Large Volumes of Data
1.6.2 Processing Real-Time Data
1.6.3 Visual Representation of Data
1.6.4 Data from Multiple Sources
1.6.5 Inaccessible Data
1.6.6 Poor-Quality Data
1.6.7 Higher Management Pressure
1.6.8 Lack of Support
1.6.9 Budget
1.6.10 Shortage of Skills
1.7 Top Tools in Data Analytics
1.8 Business Intelligence
1.9 Data Analytics vs Data Analysis
1.10 Data Analytics vs Data Visualization
1.11 Data Analyst vs Data Scientist
1.12 Data Analytics vs Business Intelligence
1.13 Data Analysis vs Data Mining
1.14 What Is ETL?
1.14.1 Extraction
1.14.2 Transformation
1.14.3 Loading
1.15 Data Science
Chapter 2: Applications of Data Science
2.1 Data Science Applications in Healthcare
2.2 Data Science Applications in Education
2.3 Data Science Applications in Manufacturing and Production
2.4 Data Science Applications in Sports
2.5 Data Science Applications in Cybersecurity
2.6 Data Science Applications in Airlines
Chapter 3: Widely Used Techniques in Data Science Applications
3.1 Supervised Learning
3.2 Unsupervised Learning
3.3 Reinforcement Learning
3.4 AB Testing
3.4.1 AB Test Planning
3.4.2 AB Testing Can Make a Huge Difference
3.4.3 What You Can Test
3.4.4 For How Long You Should Test
3.5 Association Rules
3.5.1 Support
3.5.2 Confidence
3.5.3 Lift
3.6 Decision Tree
3.6.1 How to Draw a Decision Tree
3.6.2 Advantages and Disadvantages
3.6.3 Decision Trees in Machine Learning and Data Mining
3.6.4 Advantages and Disadvantages
3.7 Cluster Analysis
3.7.1 Different Approaches of Cluster Analysis
3.7.2 Types of Data and Measures of Distance
3.7.2.1 Euclidean Distance
3.7.2.2 Hierarchical Agglomerative Methods
3.7.2.3 Nearest Neighbor Method (Single Linkage Method)
3.7.2.4 Furthest Neighbor Method (Complete Linkage Method)
3.7.2.5 Average (Between Groups) Linkage Method (Sometimes Referred to as UPGMA)
3.7.2.6 Centroid Method
3.7.3 Selecting the Optimum Number of Clusters
3.8 Advantages and Disadvantages of Clustering
3.9 Pattern Recognition
3.9.1 Pattern Recognition Process
3.9.2 Training and Test Datasets
3.9.3 Applications of Pattern Recognition
3.10 Summary
Chapter 4: Data Preprocessing
4.1 Feature
4.1.1 Numerical
4.1.2 Categorical Features
4.2 Feature Selection
4.2.1 Supervised Feature Selection
4.2.2 Unsupervised Feature Selection
4.3 Feature Selection Methods
4.3.1 Transformation-Based Reduction
4.3.1.1 Principal Component Analysis
4.3.1.2 Classical Multidimensional Scaling
4.3.1.3 Locally Linear Embedding
4.3.1.4 Isomap
4.3.2 Selection-Based Reduction
4.3.2.1 Filter-Based Methods
4.3.2.2 Wrapper Methods
4.3.2.3 Embedded Methods
4.4 Objective of Feature Selection
4.5 Feature Selection Criteria
4.5.1 Information Gain
4.5.2 Distance
4.5.3 Dependency
4.5.4 Consistency
4.5.5 Classification Accuracy
4.6 Feature Generation Schemes
4.7 Rough Set Theory
4.7.1 Basic Concepts of Rough Set Theory
4.7.1.1 Information System
4.7.1.2 Indiscernibility
4.7.1.3 Lower and Upper Approximations
4.7.1.4 Boundary Region
4.7.1.5 Dependency
4.7.1.6 Reduct Set
4.7.1.7 Discernibility Matrix
4.7.2 Rough Set-Based Feature Selection Techniques
4.7.2.1 Quick Reduct
4.7.2.2 Rough Set-Based Genetic Algorithm
4.7.2.3 Incremental Feature Selection Algorithm
4.7.2.4 Fish Swarm Algorithm
4.7.2.5 Feature Selection Approach Using Random Feature Vector
4.7.3 Dominance-Based Rough Set Approach
4.7.3.1 Decision System
4.7.3.2 Dominance Relation
4.7.3.3 Upward and Downward Union of Classes
4.7.3.4 Lower and Upper Approximations
4.7.3.5 Applications of Dominance-Based Rough Set Theory
4.7.4 Comparison of Feature Selection Techniques
4.8 Miscellaneous Concepts
4.8.1 Feature Relevancy
4.8.2 Feature Redundancy
4.8.3 Applications of Feature Selection
4.8.3.1 Text Mining
4.8.3.2 Intrusion Detection
4.8.3.3 Information Systems
4.9 Feature Selection: Issues
4.9.1 Scalability
4.9.2 Stability
4.9.3 Linked Data
4.10 Different Types of Feature Selection Algorithms
4.11 Genetic Algorithm
4.12 Feature Engineering
4.12.1 Feature Encoding
4.12.1.1 One-Hot Encoding
4.12.1.2 Label Encoding and Ordinal Encoding
4.12.1.3 Frequency Encoding
4.12.1.4 Target Encoding
4.13 Binning
4.13.1 Equal Width Bins
4.13.2 Equal Frequency Bins
4.13.3 Smoothing by Bin Means
4.13.4 Smoothing by Bin Boundaries
4.14 Remove Missing Values
4.14.1 Remove the Objects Containing Missing Values
4.14.2 Remove Using the Default Values
4.14.3 Remove Missing Values Using Mean/Mode
4.14.4 Remove Missing Values by Using the Closest Distance
4.15 Proximity Measures
4.16 Dissimilarity Matrix
4.16.1 Manhattan Distance
4.16.2 Euclidian Distance
4.16.3 Supremum Distance
4.17 Summary
Chapter 5: Classification
5.1 Classification
5.2 Decision Tree
5.2.1 Design Issues of Decision Tree Induction
5.2.2 Model Overfitting
5.2.3 Entropy
5.2.4 Information Gain
5.3 Regression Analysis
5.4 Support Vector Machines
5.5 Naïve Bayes
5.6 Artificial Neural Networks
5.7 K Nearest Neighbors (KNN)
5.8 Ensembles
5.8.1 Assembling an Ensemble of Classifiers
5.8.2 Majority Voting Ensemble
5.8.3 Boosting/Weighted Voting Ensemble
5.8.4 Bagging
5.9 Methods for Model Evaluation
5.9.1 Holdout Method
5.9.2 Random Subsampling
5.9.3 Cross-validation
5.10 Summary
Chapter 6: Clustering
6.1 Cluster Analysis
6.2 Types of Clusters
6.2.1 Hierarchical Clusters
6.2.2 Partitioned Clusters
6.2.3 Other Cluster Types
6.2.3.1 Well Separated
6.2.3.2 Prototype-Based
6.2.3.3 Contiguous Clusters
6.2.3.4 Density Based
6.2.3.5 Shared Property
6.3 K-Means
6.3.1 Centroids and Object Assignment
6.3.2 Centroids and Objective Function
6.4 Reducing the SSE with Post-processing
6.4.1 Split a Cluster
6.4.2 Introduce a New Cluster Centroid
6.4.3 Disperse a Cluster
6.4.4 Merge Two Clusters
6.5 Bisecting K-Means
6.6 Agglomerative Hierarchical Clustering
6.7 DBSCAN Clustering Algorithm
6.8 Cluster Evaluation
6.9 General Characteristics of Clustering Algorithms
Chapter 7: Text Mining
7.1 Text Mining
7.2 Text Mining Applications
7.2.1 Exploratory Text Analysis
7.2.2 Information Extraction
7.2.3 Automatic Text Classification
7.3 Text Classification Types
7.3.1 Single-Label Text Classification
7.3.2 Multi-label Text Classification
7.3.3 Binary Text Classification
7.4 Text Classification Approaches
7.4.1 Rule Based
7.4.2 Machine Learning
7.5 Text Classification Applications
7.5.1 Document Organization
7.5.2 Text Filtering
7.5.3 Computational Linguistics
7.5.4 Hierarchical Web Page Categorization
7.6 Representation of Textual Data
7.6.1 Bag of Words
7.6.2 Term Frequency
7.6.3 Inverse Document Frequency
7.6.4 Term Frequency-Inverse Document Frequency
7.6.5 N-Gram
7.7 Natural Language Processing
7.7.1 Sentence Segmentation
7.7.2 Tokenization
7.7.3 Part of Speech Tagging
7.7.4 Named Entity Recognition
7.8 Case Study of Textual Analysis Using NLP
Chapter 8: Deep Learning
8.1 Applications of Deep Learning
8.1.1 Self-Driving Cars
8.1.2 Fake News Detection
8.1.3 Natural Language Processing
8.1.4 Virtual Assistants
8.1.5 Healthcare
8.2 Artificial Neurons
8.3 Activation Functions
8.3.1 Sigmoid
8.3.2 Softmax Probabilistic Function
8.3.3 Rectified Linear Unit (ReLU)
8.3.4 Leaky ReLU
8.4 How Neural Networks Learn
8.4.1 Initialization
8.4.2 Feed Forward
8.4.3 Error Calculation
8.4.4 Propagation
8.4.5 Adjustment
8.5 A Simple Example
8.6 Deep Neural Network
8.6.1 Selection of Number of Neurons
8.6.2 Selection of Number of Layers
8.6.3 Dropping the Neurons
8.6.4 Mini Batching
8.7 Convolutional Neural Networks
8.7.1 Convolution in n-Dimensions
8.7.2 Learning Lower-Dimensional Representations
8.8 Recurrent Neural Networks (RNN)
8.8.1 Long Short-Term Memory Models
8.8.1.1 Forget Gate
8.8.1.2 Input Gate
8.8.1.3 Output Gate
8.8.2 Encoder and Decoder
8.8.2.1 Encoder
8.8.2.2 Decoder
8.9 Limitations of RNN
8.10 Difference Between CNN and RNN
8.11 Elman Neural Networks
8.12 Jordan Neural Networks
8.12.1 Wind Speed Forecasting
8.12.2 Classification of Protein-Protein Interaction
8.12.3 Classification of English Characters
8.13 Autoencoders
8.13.1 Architecture of Autoencoders
8.13.2 Applications of Autoencoders
8.14 Training a Deep Learning Neural Network
8.14.1 Training Data
8.14.2 Choose Appropriate Activation Functions
8.14.3 Number of Hidden Units and Layers
8.14.4 Weight Initialization
8.14.5 Learning Rates
8.14.6 Learning Methods
8.14.7 Keep Dimensions of Weights in the Exponential Power of 2
8.14.8 Mini-batch vs Stochastic Learning
8.14.9 Shuffling Training Examples
8.14.10 Number of Epochs/Training Iterations
8.14.11 Use Libraries with GPU and Automatic Differentiation Support
8.15 Challenges in Deep Learning
8.15.1 Lots and Lots of Data
8.15.2 Overfitting in Neural Networks
8.15.3 Hyperparameter Optimization
8.15.4 Requires High-Performance Hardware
8.15.5 Neural Networks Are Essentially a Black Box
8.16 Some Important Deep Learning Libraries
8.16.1 NumPy
8.16.2 SciPy
8.16.3 scikit-learn
8.16.4 Theano
8.16.5 TensorFlow
8.16.6 Keras
8.16.7 PyTorch
8.16.8 pandas
8.16.9 matplotlib
8.16.10 Scrapy
8.16.11 Seaborn
8.16.12 PyCaret
8.16.13 OpenCV
8.16.14 Caffe
8.17 First Practical Neural Network
8.17.1 Input Data
8.17.2 Construction of the Model
8.17.3 Configure the Model
8.17.4 Training and Testing the Model
8.17.5 Alternate Method
Chapter 9: Frequent Pattern Mining
9.1 Basic Concepts
9.1.1 Market Basket Example
9.1.2 Association Rule
9.1.3 Lift
9.1.4 Binary Representation
9.2 Association Rules
9.2.1 Support and Confidence
9.2.2 Null Transactions
9.2.3 Negative Association Rules
9.2.4 Multilevel Association Rules
9.2.5 Approaches to Multilevel Association Rule Mining
9.2.5.1 Uniform Minimum Support
9.2.5.2 Reduced Minimum Support
9.2.5.3 Checking for Redundant Multilevel Association Rules
9.2.6 Multidimensional Association Rules
9.2.7 Mining Quantitative Association Rules
9.2.8 Mining Multidimensional Association Rules Using Static Discretization of Quantitative Attributes
9.3 Frequent Pattern Mining Methods
9.3.1 Apriori Algorithm
9.3.2 Fuzzy Apriori Algorithm
9.3.3 FP-Tree
9.3.4 FP-Growth: Generation of Frequent Itemsets in FP-Growth Algorithm
9.3.5 ECLAT
9.4 Pattern Evaluation Methods
9.5 Applications of Frequent Pattern Mining
9.5.1 Frequent Patterns for Consumer Analysis
9.5.2 Frequent Patterns for Clustering
9.5.3 Frequent Patterns for Classification
9.5.4 Frequent Patterns for Outlier Analysis
9.5.5 Web Mining Applications
9.5.6 Temporal Applications
9.5.7 Spatial and Spatiotemporal Applications
9.5.8 Software Bug Detection
9.5.9 Chemical and Biological Applications
9.5.10 Frequent Pattern Mining in Indexing
Chapter 10: Regression Analysis
10.1 Regression for Data Science
10.1.1 Basic Concepts
10.1.1.1 Linear Regression (Introduction)
10.1.1.2 Example of Linear Regression
10.1.2 Multiple Regression
10.1.2.1 Example of Multiple Regression
10.1.2.2 Multiple Regression Equation
10.1.2.3 Use of Multiple Regression
10.1.3 Polynomial Regression
10.1.3.1 Types of Polynomial Regression
10.1.3.2 Use Case for Polynomial Regression
10.1.3.3 Overfitting vs Under-fitting
10.1.3.4 Choice of Right Degree
10.1.3.5 Loss Function
10.2 Logistic Regression
10.2.1 Logistic Regression Importance
10.2.2 Logistic Regression Assumptions
10.2.3 Use Cases of Logistic Regression
10.2.4 Difference Between Linear and Logistic Regression
10.2.5 Probability-Based Approach
10.2.5.1 General Principle
10.2.5.2 Logistic Regression Odds Ratio
10.2.6 Multiclass Logistic Regression
10.2.6.1 Workflow of Multiclass Logistic Regression
10.3 Generalization
10.3.1 Has the Model Learned All?
10.3.2 Validation Loss
10.3.3 Bias
10.3.4 Variance
10.3.5 Overfitting
10.3.6 Regularization
10.4 Advanced Regression Methods
10.4.1 Bayesian Regression
10.4.2 Regression Trees
10.4.3 Bagging and Boosting
10.4.3.1 Bagging
10.4.3.2 Example of Bagging
10.4.3.3 Boosting
10.4.3.4 Example of Boosting
10.4.3.5 Comparison of Bagging and Boosting
10.5 Real-World Applications for Regression Models
10.5.1 Imbalanced Classification
10.5.2 Ranking Problem
10.5.3 Time Series Problem
10.5.3.1 Time Series Analysis
10.5.3.2 Examples of Time Series Analysis
10.5.3.3 Machine Learning and Time Series Analysis
10.5.3.4 Time Series Forecasting Using Machine Learning Methods
Chapter 11: Data Science Programming Language
11.1 Python
11.1.1 Python Reserved Words
11.1.2 Lines and Indentation
11.1.3 Multi-line Statements
11.1.4 Quotations in Python
11.1.5 Comments in Python
11.1.6 Multi-line Comments
11.1.7 Variables in Python
11.1.8 Standard Data Types in Python
11.1.9 Python Numbers
11.1.10 Python Strings
11.1.11 Python Lists
11.1.12 Python Tuples
11.1.13 Python Dictionary
11.1.14 If Statement
11.1.15 IfElse Statement
11.1.16 elif Statement
11.1.17 Iterations or Loops in Python
11.1.18 While Loop
11.1.19 For Loop
11.1.20 Using Else with Loop
11.1.21 Nested Loop
11.1.22 Function in Python
11.1.23 User-Defined Function
11.1.24 Pass by Reference vs Value
11.1.25 Function Arguments
11.1.26 Required Arguments
11.1.27 Keyword Arguments
11.1.28 Default Arguments
11.1.29 Variable-Length Arguments
11.1.30 The Return Statement
11.2 Python IDLE
11.3 R Programming Language
11.3.1 Our First Program
11.3.2 Comments in R
11.3.3 Data Types in R
11.3.4 Vectors in R
11.3.5 Lists in R
11.3.6 Matrices in R
11.3.7 Arrays in R
11.3.8 Factors in R
11.3.9 Data Frames in R
11.3.10 Decision-Making in R
11.3.11 If Statement in R
11.3.12 IfElse in R
11.3.13 Nested IfElse in R
11.3.14 Loops in R
11.3.15 Repeat Loop in R
11.3.16 While Loop in R
11.3.17 For Loop in R
11.3.18 Break Statement in R
11.3.19 Functions in R
11.3.20 Function Without Arguments
11.3.21 Function with Arguments
Chapter 12: Practical Data Science with WEKA
12.1 Installation
12.2 Loading the Data
12.3 Applying Filters
12.4 Classifier
12.5 Cluster
12.6 Association Rule Mining in WEKA
12.7 Attribute Selection
12.8 The Experimenter Interface
12.9 WEKA KnowledgeFlow
12.10 WEKA Workbench
12.11 WEKA Simple CLI
12.11.1 Java Command
12.11.2 Cls Command
12.11.3 Echo Command
12.11.4 Exit Command
12.11.5 History Command
12.11.6 Kill
12.11.7 Script Command
12.11.8 Set Command
12.11.9 Unset Command
12.11.10 Help Command
12.11.11 Capabilities Command
12.12 ArffViewer
12.13 WEKA SqlViewer
12.14 Bayesian Network Editor
Appendix: Mathematical Concepts for Deep Learning
A.1 Vectors
A.1.1 Vector
A.1.2 Vector Addition
A.1.3 Vector Subtraction
A.1.4 Vector Multiplication by Scalar Value
A.1.5 Zero Vector
A.1.6 Unit Vector
A.1.7 Vector Transpose
A.1.8 Inner Product of Vectors
A.1.9 Orthogonal Vectors
A.1.10 Distance Between Vectors
A.1.11 Outer Product
A.1.12 Angles Between the Vectors
A.2 Matrix
A.2.1 Matrix Addition
A.2.2 Matrix Subtraction
A.2.3 Matrix Multiplication by Scalar Value
A.2.4 Unit Matrix
A.2.5 Matrix Multiplication
A.2.6 Zero Matrix
A.2.7 Diagonal Matrix
A.2.8 Triangular Matrix
A.2.9 Vector Representation of Matrix
A.2.10 Determinant
A.3 Probability
A.3.1 Experiment
A.3.2 Event
A.3.3 Mutually Exclusive Events
A.3.4 Union of Probabilities
A.3.5 Complement of Probability
A.3.6 Conditional Probability
A.3.7 Intersection of Probabilities
A.3.8 Probability Tree
A.3.9 Probability Axioms
Glossary