This textbook provides an easy-to-understand introduction to the mathematical concepts and algorithms at the foundation of Data Science. It covers essential parts of data organization, descriptive and inferential statistics, probability theory, and Machine Learning. These topics are presented in a clear and mathematical sound way to help readers gain a deep and fundamental understanding. Numerous application examples based on real data are included. The book is well-suited for lecturers and students at technical universities, and offers a good introduction and overview for people who are new to the subject. Basic mathematical knowledge of calculus and linear algebra is required.
In this chapter, we will deal with supervised machine learning. Supervised methods are based on the statistical evaluation of a sample where each observation comes with an already known assignment of a label that the algorithm is ultimately supposed to predict for yet unseen data. That sample is called the training dataset. Keeping with the image classification example, a training dataset would consist of a (large) number of photographs, each of which has been (manually) annotated with one of the labels: landscape, portrait, etc. Ideally, the learning algorithm is then able to recognize patterns that characterize and distinguish between landscape and portrait photographs. More concretely, these patterns are statistical variations of features. For digital photographs, the raw features are given by the color values of each pixel. From these statistical patterns, rules are generated that are able to categorize new, yet to be seen photos that were not contained in the training dataset. These rules are not explicitly specified by the programmer but are “learned” by the machine on the basis of the training dataset.
Author(s): Matthias Plaue
Publisher: Springer
Year: 2023
Language: English
Pages: 372
Preface
Preface to the original German edition.
Preface to the revised, English edition.
Contents
Notation
List of Figures
List of Tables
Introduction
References
Part I Basics
1 Elements of data organization
1.1 Conceptual data models
1.1.1 Entity–relationship models
1.2 Logical data models
1.2.1 Relational data models
1.2.2 Graph-based data models
1.2.3 Hierarchical data models
1.3 Data quality
1.3.1 Data quality dimensions
1.4 Data cleaning
1.4.1 Validation
1.4.2 Standardization
1.4.3 Imputation
1.4.3.1 Imputation with measures of central tendency
1.4.3.2 Imputation via regression and classification
1.4.4 Augmentation
1.4.5 Deduplication
1.4.5.1 Distance and similarity measures for strings
References
2 Descriptive statistics
2.1 Samples
2.2 Statistical charts
2.2.1 Bar charts and histograms
2.2.2 Scatter plots
2.2.3 Pie charts, grouped and stacked bar charts, heatmaps
2.2.3.1 Pie chart
2.2.3.2 Grouped and stacked bar charts
2.2.3.3 Heatmap
2.3 Measures of central tendency
2.3.1 Arithmetic mean and sample median
2.3.2 Sample quantiles
2.3.3 Geometric and harmonic mean
2.4 Measures of variation
2.4.1 Deviation around the mean or the median
2.4.2 Shannon index
2.5 Measures of association
2.5.1 Sample covariance and Pearson’s correlation coefficient
2.5.2 Rank correlation coefficients
2.5.3 Sample mutual information and Jaccard index
References
Part II Stochastics
3 Probability theory
3.1 Probability measures
3.1.1 Conditional probability
3.1.2 Bayes’ theorem
3.2 Random variables
3.2.1 Discrete and continuous random variables
3.2.2 Probability mass and density functions
3.2.2.1 Probability mass functions of discrete random variables
3.2.2.2 Probability density functions of continuous random variables
3.2.3 Transformations of random variables
3.2.3.1 Transformations of discrete random variables
3.2.3.2 Transformations of continuous random variables
3.3 Joint distribution of random variables
3.3.1 Joint probability mass and density functions
3.3.2 Conditional probability mass and density functions
3.3.3 Independent random variables
3.4 Characteristic measures of random variables
3.4.1 Median, expected value, and variance
3.4.2 Covariance and correlation
3.4.3 Chebyshev’s inequality
3.5 Sums and products of random variables
3.5.1 Chi-squared and Student’s t-distribution
References
4 Inferential statistics
4.1 Statistical models
4.1.1 Models of discrete random variables
4.1.2 Models of continuous random variables
4.2 Laws of large numbers
4.2.1 Bernoulli’s law of large numbers
4.2.2 Chebyshev’s law of large numbers
4.2.3 Variance estimation and Bessel correction
4.2.4 Lindeberg–Lévy central limit theorem
4.3 Interval estimation and hypothesis testing
4.3.1 Interval estimation
4.3.2 Z-test
4.3.3 Student’s t-test
4.3.4 Effect size
4.4 Parameter and density estimation
4.4.1 Maximum likelihood estimation
4.4.1.1 Power transforms
4.4.2 Bayesian parameter estimation
4.4.3 Kernel density estimation
4.5 Regression analysis
4.5.1 Simple linear regression
4.5.2 Theil–Sen regression
4.5.3 Simple logistic regression
References
5 Multivariate statistics
5.1 Data matrices
5.2 Distance and similarity measures
5.2.1 Distance and similarity measures for numeric variables
5.2.2 Distance and similarity measures for categorical variables
5.2.3 Distance and similarity matrices
5.3 Multivariate measures of central tendency and variation
5.3.1 Centroid and geometric median, medoid
5.3.2 Sample covariance and correlation matrix
5.4 Random vectors and matrices
5.4.1 Expectation vector and covariance matrix
5.4.2 Multivariate normal distributions
5.4.3 Multinomial distributions
References
Part III Machine learning
6 Supervised machine learning
6.1 Elements of supervised learning
6.1.1 Loss functions and empirical risk minimization
6.1.2 Overfitting and underfitting
6.1.2.1 Regularization
6.1.3 Training, model validation, and testing
6.1.3.1 Performance measures for regression
6.1.3.2 Performance measures for binary classification
6.1.4 Numerical optimization
6.2 Regression algorithms
6.2.1 Linear regression
6.2.1.1 Moore–Penrose inverse
6.2.2 Gaussian process regression
6.3 Classification algorithms
6.3.1 Logistic regression
6.3.1.1 Kernel logistic regression
6.3.2 K-nearest neighbors classification
6.3.3 Bayesian classification algorithms
6.3.3.1 Naive Bayes classification
6.3.3.2 Multinomial event model
6.3.3.3 Bernoulli event model
6.4 Artificial neural networks
6.4.1 Regression and classification with neural networks
6.4.2 Training neural networks by backpropagation of error
6.4.2.1 Dropout Dropout
6.4.3 Convolutional neural networks
References
7 Unsupervised machine learning
7.1 Elements of unsupervised learning
7.1.1 Intrinsic dimensionality of data
7.1.2 Topological characteristics of data
7.2 Dimensionality reduction
7.2.1 Principal component analysis
7.2.2 Autoencoders
7.2.3 Multidimensional scaling
7.2.4 t-distributed stochastic neighbor embedding (t-SNE)
7.3 Cluster analysis
7.3.1 K-means algorithm
7.3.1.1 Kernel K-means algorithm
7.3.2 Hierarchical cluster analysis
References
8 Applications of machine learning
8.1 Supervised learning in practice
8.1.1 MNIST: handwritten text recognition
8.1.2 CIFAR-10: object recognition
8.1.3 Large Movie Review Dataset: sentiment analysis
8.2 Unsupervised learning in practice
8.2.1 Text mining: topic modelling
8.2.2 Network analysis: community structure
References
Appendix
A Exercises with answers
A.1 Exercises
A.2 Answers
References
B Mathematical preliminaries
B.1 Basic concepts
B.1.1 Numbers and sets
B.1.2 Maps and functions
B.1.3 Families, sequences and tuples
B.1.4 Minimum/maximum and infimum/supremum
B.2 Linear algebra
B.2.1 Vectors and points
B.2.2 Matrices
B.2.3 Subspaces and linear maps
B.2.4 Eigenvectors and eigenvalues
B.3 Multivariate calculus
B.3.1 Limits
B.3.2 Continuous functions
B.3.3 Differentiable functions
B.3.4 Integrals
References
Supplementary literature
References
Index