Discover hidden relationships among the variables in your data, and learn how to exploit these relationships. This book presents a collection of data-mining algorithms that are effective in a wide variety of prediction and classification applications. All algorithms include an intuitive explanation of operation, essential equations, references to more rigorous theory, and commented C++ source code.
Many of these techniques are recent developments, still not in widespread use. Others are standard algorithms given a fresh look. In every case, the focus is on practical applicability, with all code written in such a way that it can easily be included into any program. The Windows-based DATAMINE program lets you experiment with the techniques before incorporating them into your own work.
What you'll learn
Monte-Carlo permutation tests provide statistically sound assessment of relationships present in your data.
Combinatorially symmetric cross validation reveals whether your model has true power or has just learned noise by overfitting the data.
Feature weighting as regularized energy-based learning ranks variables according to their predictive power when there is too little data for traditional methods. The eigenstructure of a dataset enables clustering of variables into groups that exist only within meaningful subspaces of the data. Plotting regions of the variable space where there is disagreement between marginal and actual densities, or where contribution to mutual information is high, provides visual insight into anomalous relationships.
Who this book is for
The techniques presented in this book and in the DATAMINE program will be useful to anyone interested in discovering and exploiting relationships among variables. Although all code examples are written in C++, the algorithms are described in sufficient detail that they can easily be programmed in any language.
Author(s): Timothy Masters
Publisher: Apress
Year: 2018
Language: English
Pages: 286
Table of Contents
About the Author
About the Technical Reviewers
Introduction
Chapter 1: Information and Entropy
Entropy
Entropy of a Continuous Random Variable
Partitioning a Continuous Variable for Entropy
An Example of Improving Entropy
Joint and Conditional Entropy
Code for Conditional Entropy
Mutual Information
Fano’s Bound and Selection of Predictor Variables
Confusion Matrices and Mutual Information
Extending Fano’s Bound for Upper Limits
Simple Algorithms for Mutual Information
The TEST_DIS Program
Continuous Mutual Information
The Parzen Window Method
Adaptive Partitioning
The TEST_CON Program
Asymmetric Information Measures
Uncertainty Reduction
Transfer Entropy: Schreiber’s Information Transfer
Chapter 2: Screening for Relationships
Simple Screening Methods
Univariate Screening
Bivariate Screening
Forward Stepwise Selection
Forward Selection Preserving Subsets
Backward Stepwise Selection
Criteria for a Relationship
Ordinary Correlation
Nonparametric Correlation
Accommodating Simple Nonlinearity
Chi-Square and Cramer’s V
Mutual Information and Uncertainty Reduction
Multivariate Extensions
Permutation Tests
A Modestly Rigorous Statement of the Procedure
A More Intuitive Approach
Serial Correlation Can Be Deadly
Permutation Algorithms
Outline of the Permutation Test Algorithm
Permutation Testing for Selection Bias
Combinatorially Symmetric Cross Validation
The CSCV Algorithm
An Example of CSCV OOS Testing
Univariate Screening for Relationships
Three Simple Examples
Bivariate Screening for Relationships
Stepwise Predictor Selection Using Mutual Information
Maximizing Relevance While Minimizing Redundancy
Code for the Relevance Minus Redundancy Algorithm
An Example of Relevance Minus Redundancy
A Superior Selection Algorithm for Binary Variables
FREL for High-Dimensionality, Small Size Datasets
Regularization
Interpreting Weights
Bootstrapping FREL
Monte Carlo Permutation Tests of FREL
General Statement of the FREL Algorithm
Multithreaded Code for FREL
Some FREL Examples
Chapter 3: Displaying Relationship Anomalies
Marginal Density Product
Actual Density
Marginal Inconsistency
Mutual Information Contribution
Code for Computing These Plots
Comments on Showing the Display
Chapter 4: Fun with Eigenvectors
Eigenvalues and Eigenvectors
Principal Components (If You Really Must)
The Factor Structure Is More Interesting
A Simple Example
Rotation Can Make Naming Easier
Code for Eigenvectors and Rotation
Eigenvectors of a Real Symmetric Matrix
Factor Structure of a Dataset
Varimax Rotation
Horn’s Algorithm for Determining Dimensionality
Code for the Modified Horn Algorithm
Clustering Variables in a Subspace
Code for Clustering Variables
Separating Individual from Common Variance
Log Likelihood the Slow, Definitional Way
Log Likelihood the Fast, Intelligent Way
The Basic Expectation Maximization Algorithm
Code for Basic Expectation Maximization
Accelerating the EM Algorithm
Code for Quadratic Acceleration with DECME-2s
Putting It All Together
Thoughts on My Version of the Algorithm
Measuring Coherence
Code for Tracking Coherence
Coherence in the Stock Market
Chapter 5: Using the DATAMINE Program
File/Read Data File
File/Exit
Screen/Univariate Screen
Screen/Bivariate Screen
Screen/Relevance Minus Redundancy
Screen/FREL
Analyze/Eigen Analysis
Analyze/Factor Analysis
Analyze/Rotate
Analyze/Cluster Variables
Analyze/Coherence
Plot/Series
Plot/Histogram
Plot/Density
Index