This book is designed for data scientists, machine learning practitioners, and anyone with a foundational understanding of Python 3.x. In the evolving field of data science, the ability to manipulate and understand datasets is crucial. The book offers content for mastering these skills using Python 3. The book provides a fast-paced introduction to a wealth of feature engineering concepts, equipping readers with the knowledge needed to transform raw data into meaningful information. Inside, you’ll find a detailed exploration of various types of data, methodologies for outlier detection using Scikit-Learn, strategies for robust data cleaning, and the intricacies of data wrangling. The book further explores feature selection, detailing methods for handling imbalanced datasets, and gives a practical overview of feature engineering, including scaling and extraction techniques necessary for different machine learning algorithms. It concludes with a treatment of dimensionality reduction, where you’ll navigate through complex concepts like PCA and various reduction techniques, with an emphasis on the powerful Scikit-Learn framework.
Features:
- Includes numerous practical examples and partial code blocks that illuminate the path from theory to application
- Explores everything from data cleaning to the subtleties of feature selection and extraction, covering a wide spectrum of feature engineering topics
- Offers an appendix on working with the “awk” command-line utility
- Features companion files available for downloading with source code, datasets, and figures.
What do i need to know for this book?
A current knowledge of Python 3.x is useful because all the code samples are in Python. Knowledge of data structures will enable you to progress through the related chapters more quickly. The less technical knowledge that you have, the more diligence will be required in order to understand the various topics that are covered. If you want to be sure that you can grasp the material in this book, glance through some of the code samples to get an idea of how much is familiar to you and how much is new for you.
Author(s): Oswald Campesato
Publisher: Mercury Learning and Information
Year: 2023
Language: English
Pages: 229
Cover
Title Page
Copyright Page
Dedication
Contents
Preface
Chapter 1: Working With Datasets
Exploratory Data Analysis (EDA)
EDA Code Sample: Titanic
EDA and Histograms
Dealing With Data: What Can Go Wrong?
Datasets
Explanation of Data Types
Binary Data
Nominal Data
Ordinal Data
Categorical Data
Interval Data
Ratio Data
Continuous Data Versus Discrete Data
Random Variables
Qualitative and Quantitative Data
Types of Statistical Data
Data Preprocessing
Working With Data Types
Data Drift
What Is Data Leakage?
Data Leakage and Differential Privacy
Model Selection and Preparing Datasets
Model Selection
Discrete Data Versus Continuous Data
“Binning” Data Values
Programmatic Binning Techniques
Potential Issues When Binning Data Values
Handling Categorical Data
Processing Inconsistent Categorical Data
Mapping Categorical Data to Numeric Values
Types of Dependencies Among Features
Homoskedasticity and Heteroskedasticity
Collinearity
Variance Inflation Factor
Multicollinearity
Correlation
Working With Currency
Working With Dates
Splitting and Scaling Data
Why Normalize Data?
Split Before Normalizing Data
Scaling Numeric Data via Normalization
Scaling Numeric Data to the Range [a,b]
Scaling Numeric Data via Standardization
The StandardScaler Class
Scaling Numeric Data via Robust Standardization
Deciding How to Scale Data
Summary
Chapter 2: Outlier and Anomaly Detection
Working With Outliers
Outliers Versus Data Drift
Outlier Detection/Removal
Incorrectly Scaled Values Versus Outliers
Other Outlier Techniques
Finding Outliers With Numpy
Finding Outliers With Pandas
Calculating Z-Scores to Find Outliers
Finding Outliers With SkLearn (Optional)
Fraud Detection
Techniques for Anomaly Detection
Summary
Chapter 3: Data Cleaning Tasks
What Is Data Cleaning?
Data Cleaning for Personal Titles
Data Cleaning in SQL
Replace NULL With 0
Replace NULL Values With Average Value
Replace Multiple Values With a Single Value
Handle Mismatched Attribute Values
Convert Strings to Date Values
Data Cleaning From the Command Line (Optional)
Working With the sed Utility
Working With Variable Column Counts
Truncating Rows in CSV Files
Generating Rows With Fixed Columns With the awk Utility
Converting Phone Numbers
Converting Numeric Date Formats
Converting Alphabetic Date Formats
Working With Date and Time Date Formats
Working With Codes, Countries, and Cities
Data Cleaning on a Kaggle Dataset
Summary
Chapter 4: Data Wrangling
What Is Data Wrangling?
Data Transformation: What Does This Mean?
CSV Files With Multi-Row Records
Pandas Solution (1)
Pandas Solution (2)
CSV Solution
CSV Files, Multi-Row Records, and the awk Command
Quoted Fields Split on Two Lines (Optional)
Overview of the Events Project
Why This Project?
Project Tasks
Generate Country Codes
Prepare List of Cities in Countries
Generating City Codes From Country Codes: awk
Generating City Codes From Country Codes: Python
Generating SQL Statements for the city_codes Table
Generating a CSV File for Band Members (Java)
Generating a CSV File for Band Members (Python)
Generating a Calendar of Events (COE)
Project Automation Script
Project Follow-Up Comments
Summary
Chapter 5: Feature Selection
What Is Feature Selection?
Three Types of Feature Selection Methods
Filter Methods
Variance Threshold
Chi-Squared Test
ANOVA F-test
Mutual Information
Correlation Coefficient
Wrapper Methods
Recursive Feature Elimination (RFE)
Recursive Feature Elimination With Cross-Validation (RFECV)
Sequential Feature Selection (SFS)
Backward Feature Elimination
Boruta
Embedded Methods
L1 Regularization (Lasso)
Decision Trees (and Tree-Based Models)
Elastic Net
LightGBM
Linear Models with Recursive Feature Elimination
The Need for Feature Scaling and Transformations
Labeled, Unlabeled, and Multiclass Classification
Labeled Versus Unlabeled Data
Working With Imbalanced Datasets
Detecting Imbalanced Data
Rebalancing Datasets
Specify Stratify in Data Splits
Feature Importance
What Is SMOTE?
SMOTE Extensions
An Alternative to SMOTE
What Are Transforms?
Cube Root Transformation
Other Transformations
Summary
Chapter 6: Feature Engineering
What Is Feature Engineering?
Types of Feature Engineering
What Steps Are Required to Train a Model?
Machine Learning and Algorithm Selection
Training Large Datasets
Feature Importance
Feature Engineering and Extraction
Feature Engineering
Feature Extraction
Feature Extraction Algorithms
Feature Hashing
Feature Scaling and ML Algorithms
Selecting the Type of Scaling
Algorithms That Require Feature Scaling
Algorithms That Do Not Require Feature Scaling
Data Sampling Techniques
Undersampling
Oversampling
Resampling
Data Augmentation
Summary
Chapter 7: Dimensionality Reduction
Covariance and Correlation Matrices
Covariance Matrix
Covariance Matrix: An Example
The Correlation Matrix
Eigenvalues and Eigenvectors
Calculating Eigenvectors: A Simple Example
Gauss Jordan Elimination (Optional)
PCA (Principal Component Analysis)
The New Matrix of Eigenvectors
Dimensionality Reduction
Dimensionality Reduction Techniques
The Curse of Dimensionality
What Are Manifolds (Optional)?
SVD (Singular Value Decomposition)
LLE (Locally Linear Embedding)
UMAP
t-SNE (“tee-snee”)
PHATE
Linear Versus Nonlinear Reduction Techniques
Types of Distance Metrics
Well-Known Distance Metrics
Pearson Correlation Coefficient
Jaccard Index (or Similarity)
Local Sensitivity Hashing (Optional)
What Is Sklearn?
Sklearn, Pandas, and the IRIS Dataset
Sklearn and Outlier Detection
What Is Bayesian Inference?
Bayes’ Theorem
Some Bayesian Terminology
What Is MAP?
Why Use Bayes’ Theorem?
What Are Vector Spaces?
Summary
Appendix: Working With awk
The awk Command
Built-In Variables That Control awk
How Does the awk Command Work?
Aligning Text With the printf Statement
Conditional Logic and Control Statements
The while Statement
A for loop in awk
A for loop with a break Statement
The next and continue Statements
Deleting Alternate Lines in Datasets
Merging Lines in Datasets
Printing File Contents as a Single Line
Joining Groups of Lines in a Text File
Joining Alternate Lines in a Text File
Matching With Meta Characters and Character Sets
Printing Lines Using Conditional Logic
Splitting Filenames With awk
Working With Postfix Arithmetic Operators
Numeric Functions in awk
One-Line awk Commands
Useful Short awk Scripts
Printing the Words in a Text String in awk
Counting Occurrences of a String in Specific Rows
Printing a String in a Fixed Number of Columns
Printing a Dataset in a Fixed Number of Columns
Aligning Columns in Datasets
Aligning Columns and Multiple Rows in Datasets
Removing a Column From a Text File
Subsets of Column-Aligned Rows in Datasets
Counting Word Frequency in Datasets
Displaying Only “Pure” Words in a Dataset
Working With Multiline Records in awk
A Simple Use Case
Another Use Case
Summary
Index