Finding Ghosts in Your Data: Anomaly Detection Techniques with Examples in Python

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Discover key information buried in the noise of data by learning a variety of anomaly detection techniques and using the Python programming language to build a robust service for anomaly detection against a variety of data types. The book starts with an overview of what anomalies and outliers are and uses the Gestalt school of psychology to explain just why it is that humans are naturally great at detecting anomalies. From there, you will move into technical definitions of anomalies, moving beyond "I know it when I see it" to defining things in a way that computers can understand.
The core of the book involves building a robust, deployable anomaly detection service in Python. You will start with a simple anomaly detection service, which will expand over the course of the book to include a variety of valuable anomaly detection techniques, covering descriptive statistics, clustering, and time series scenarios. Finally, you will compare your anomaly detection service head-to-head with a publicly available cloud offering and see how they perform.
The anomaly detection techniques and examples in this book combine psychology, statistics, mathematics, and Python programming in a way that is easily accessible to software developers. They give you an understanding of what anomalies are and why you are naturally a gifted anomaly detector. Then, they help you to translate your human techniques into algorithms that can be used to program computers to automate the process. You’ll develop your own anomaly detection service, extend it using a variety of techniques such as including clustering techniques for multivariate analysis and time series techniques for observing data over time, and compare your service head-on against a commercial service.

What You Will Learn
  • Understand the intuition behind anomalies
  • Convert your intuition into technical descriptions of anomalous data
  • Detect anomalies using statistical tools, such as distributions, variance and standard deviation, robust statistics, and interquartile range
  • Apply state-of-the-art anomaly detection techniques in the realms of clustering and time series analysis
  • Work with common Python packages for outlier detection and time series analysis, such as scikit-learn, PyOD, and tslearn
  • Develop a project from the ground up which finds anomalies in data, starting with simple arrays of numeric data and expanding to include multivariate inputs and even time series data

Who This Book Is For

For software developers with at least some familiarity with the Python programming language, and who would like to understand the science and some of the statistics behind anomaly detection techniques. Readers are not required to have any formal knowledge of statistics as the book introduces relevant concepts along the way.

Author(s): Kevin Feasel
Edition: 1
Publisher: Apress
Year: 2022

Language: English
Commentary: Publisher PDF
Pages: 373
City: New York
Tags: Outlier Analysis; Anomaly Detection; Gestalt; Robust Statistics; Interquartile Range; Mahalanobis Distance; Changepoint Detection; Exponential Smoothing; Time Series Anomaly Detection; ARMA; ARIMA; Anomaly Detection as a Service; Anomaly Detection Principles and Algorithms; Anomaly Detection: Techniques and Applications; Python; Outlier and Anomaly Detection; Multivariate Anomaly Detection; Azure Cognitive Services Anomaly Detector

Table of Contents
About the Author
About the Technical Reviewer
Introduction
Chapter 1: The Importance of  Anomalies and Anomaly Detection
Defining Anomalies
Outlier
Noise vs. Anomalies
Diagnosing an Example
What If We’re Wrong?
Anomalies in the Wild
Finance
Medicine
Sports Analytics
A $23 Million Mistake
A Persistent Anomaly
Web Analytics
And Many More
Classes of Anomaly Detection
Statistical Anomaly Detection
Clustering Anomaly Detection
Model-Based Anomaly Detection
Building an Anomaly Detector
Key Goals
How Do Humans Handle Anomalies?
Known Unknowns
Conclusion
Chapter 2: Humans Are Pattern Matchers
A Primer on the Gestalt School
Key Findings of the Gestalt School
Emergence
Reification
Invariance
Multistability
Principles Implied in the Key Findings
Meaningfulness
Conciseness
Closure
Similarity
Good Continuation
Figure and Ground
Proximity
Connectedness
Common Region
Symmetry
Common Fate
Synchrony
Helping People Find Anomalies
Use Color As a Signal
Limit Nonmeaningful Information
Enable “Connecting the Dots”
Conclusion
Chapter 3: Formalizing Anomaly Detection
The Importance of Formalization
“I’ll Know It When I See It” Isn’t Enough
Human Fallibility
Marginal Outliers
The Limits of Visualization
The First Formal Tool: Univariate Analysis
Distributions and Histograms
The Normal Distribution
Mean, Variance, and Standard Deviation
Additional Distributions
Log-Normal
Uniform
Cauchy
Robustness and the Mean
The Susceptibility of Outliers
The Median and “Robust” Statistics
Beyond the Median: Calculating Percentiles
Control Charts
Conclusion
Chapter 4: Laying Out the Framework
Tools of the Trade
Choosing a Programming Language
Making Plumbing Choices
Reducing Architectural Variables
Developing an Initial Framework
Battlespace Preparation
Framing the API
Input and Output Signatures
Defining a Common Signature
Defining an Outlier
Sensitivity and Fraction of Anomalies
Single Solution
Combined Arms
Framing the Solution
Containerizing the Solution
Conclusion
Chapter 5: Building a Test Suite
Tools of the Trade
Unit Test Library
Integration Testing
Writing Testable Code
Keep Methods Separated
Emphasize Use Cases
Functional or Clean: Your Choice
Creating the Initial Tests
Unit Tests
Integration Tests
Conclusion
Chapter 6: Implementing the First Methods
A Motivating Example
Ensembling As a Technique
Sequential Ensembling
Independent Ensembling
Choosing Between Sequential and Independent Ensembling
Implementing the First Checks
Standard Deviations from the Mean
Median Absolute Deviations from the Median
Distance from the Interquartile Range
Completing the run_tests() Function
Building a Scoreboard
Weighting Results
Determining Outliers
Updating Tests
Updating Unit Tests
Updating Integration Tests
Conclusion
Chapter 7: Extending the Ensemble
Adding New Tests
Checking for Normality
Approaching Normality
A Framework for New Tests
Grubbs’ Test for Outliers
Generalized ESD Test for Outliers
Dixon’s Q Test
Calling the Tests
Updating Tests
Updating Unit Tests
Updating Integration Tests
Multi-peaked Data
A Hidden Assumption
The Solution: A Sneak Peek
Conclusion
Untitled
Chapter 8: Visualize the Results
Building a Plan
What Do We Want to Show?
How Do We Want to Show It?
Developing a Visualization App
Getting Started with Streamlit
Building the Initial Screen
Displaying Results and Details
Conclusion
Chapter 9: Clustering and Anomalies
What Is Clustering?
Common Cluster Terminology
K-Means Clustering
K-Nearest Neighbors
When Clustering Makes Sense
Gaussian Mixture Modeling
Implementing a Univariate Version
Updating Tests
Common Problems with Clusters
Choosing the Correct Number of Clusters
Clustering Is Nondeterministic
Alternative Approaches
Tree-Based Approaches
The Problem with Trees
Conclusion
Chapter 10: Connectivity-Based Outlier Factor (COF)
Distance or Density?
Local Outlier Factor
Connectivity-Based Outlier Factor
Introducing Multivariate Support
Laying the Groundwork
Implementing COF
Test and Website Updates
Unit Test Updates
Integration Test Updates
Website Updates
Conclusion
Chapter 11: Local Correlation Integral (LOCI)
Local Correlation Integral
Discovering the Neighborhood
Multi-granularity Deviation Factor (MDEF)
Multivariate Algorithm Ensembles
Ensemble Types
COF Combinations
Incorporating LOCI
Test and Website Updates
Unit Test Updates
Website Updates
Conclusion
Chapter 12: Copula-Based Outlier Detection (COPOD)
Copula-Based Outlier Detection
What’s a Copula?
Intuition Behind COPOD
Implementing COPOD
Test and Website Updates
Unit Test Updates
Integration Test Updates
Website Updates
Conclusion
Chapter 13: Time and Anomalies
What Is Time Series?
Time Series Changes Our Thinking
Autocorrelation
Smooth Movement
The Nature of Change
Data Requirements
Time Series Modeling
(Weighted) Moving Average
Exponential Smoothing
Autoregressive Models
What Constitutes an Outlier?
Local Outlier
Behavioral Changes over Time
Local Non-outlier in a Global Change
Differences from Peer Groups
Common Classes of Technique
Conclusion
Untitled
Chapter 14: Change Point Detection
What Is Change Point Detection?
Benefits of Change Point Detection
Change Point Detection with ruptures
Dynamic Programming
PELT
Implementing Change Point Detection
Test and Website Updates
Unit Tests
Integration Tests
Website Updates
Avenues of Further Improvement
Conclusion
Chapter 15: An Introduction to Multi-series Anomaly Detection
What Is Multi-series Time Series?
Key Aspects of Multi-series Time Series
What Needs to Change?
What’s the Difference?
Leading and Lagging Factors
Available Processes
Cross-Euclidean Distance
Cross-Correlation Coefficient
SameTrend (STREND)
Common Problems
Conclusion
Chapter 16: Standard Deviation of Differences (DIFFSTD)
What Is DIFFSTD?
Calculating DIFFSTD
Key Assumptions
Writing DIFFSTD
Series Processing
Segmentation
Comparing the Norm
Determining Outliers
Test and Website Updates
Unit Tests
Integration Tests
Website Updates
Conclusion
Chapter 17: Symbolic Aggregate Approximation (SAX)
What Is SAX?
Motifs and Discords
Subsequences and Matches
Discretizing the Data
Implementing SAX
Segmentation and Blocking
Making SAX Multi-series
Scoring Outliers
Test and Website Updates
Unit and Integration Tests
Website Updates
Conclusion
Chapter 18: Configuring Azure Cognitive Services Anomaly Detector
Gathering Market Intelligence
Amazon Web Services: SageMaker
Microsoft Azure: Cognitive Services
Google Cloud: AI Services
Configuring Azure Cognitive Services
Set Up an Account
Using the Demo Application
Conclusion
Chapter 19: Performing a Bake-Off
Preparing the Comparison
Supervised vs. Unsupervised Learning
Choosing Datasets
Scoring Results
Performing the Bake-Off
Accessing Cognitive Services via Python
Accessing Our API via Python
Dataset Comparisons
Lessons Learned
Making a Better Anomaly Detector
Increasing Robustness
Extending the Ensembles
Training Parameter Values
Conclusion
Untitled
Appendix
Index