Building and testing machine learning models requires access to large and diverse data. But where can you find usable datasets without running into privacy issues? This practical book introduces techniques for generating synthetic data—fake data generated from real data—so you can perform secondary analysis to do research, understand customer behaviors, develop new products, or generate new revenue.
Data scientists will learn how synthetic data generation provides a way to make such data broadly available for secondary purposes while addressing many privacy concerns. Analysts will learn the principles and steps for generating synthetic data from real datasets. And business leaders will see how synthetic data can help accelerate time to a product or solution.
This book describes:
• Steps for generating synthetic data using multivariate normal distributions
• Methods for distribution fitting covering different goodness-of-fit metrics
• How to replicate the simple structure of original data
• An approach for modeling data structure to consider complex relationships
• Multiple approaches and metrics you can use to assess data utility
• How analysis performed on real data can be replicated with synthetic data
• Privacy implications of synthetic data and methods to assess identity disclosure
Author(s): Khaled El Emam, Lucy Mosquera, Richard Hoptroff
Edition: 1
Publisher: O'Reilly Media
Year: 2020
Language: English
Commentary: Vector PDF
Pages: 166
City: Sebastopol, CA
Tags: Data Science; Privacy; Availability; Synthetic Data
Cover
Copyright
Table of Contents
Preface
Conventions Used in This Book
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Chapter 1. Introducing Synthetic Data Generation
Defining Synthetic Data
Synthesis from Real Data
Synthesis Without Real Data
Synthesis and Utility
The Benefits of Synthetic Data
Efficient Access to Data
Enabling Better Analytics
Synthetic Data as a Proxy
Learning to Trust Synthetic Data
Synthetic Data Case Studies
Manufacturing and Distribution
Healthcare
Financial Services
Transportation
Summary
Chapter 2. Implementing Data Synthesis
When to Synthesize
Identifiability Spectrum
Trade-Offs in Selecting PETs to Enable Data Access
Decision Criteria
PETs Considered
Decision Framework
Examples of Applying the Decision Framework
Data Synthesis Projects
Data Synthesis Steps
Data Preparation
The Data Synthesis Pipeline
Synthesis Program Management
Summary
Chapter 3. Getting Started: Distribution Fitting
Framing Data
How Data Is Distributed
Fitting Distributions to Real Data
Generating Synthetic Data from a Distribution
Measuring How Well Synthetic Data Fits a Distribution
The Overfitting Dilemma
A Little Light Weeding
Summary
Chapter 4. Evaluating Synthetic Data Utility
Synthetic Data Utility Framework: Replication of Analysis
Synthetic Data Utility Framework: Utility Metrics
Comparing Univariate Distributions
Comparing Bivariate Statistics
Comparing Multivariate Prediction Models
Distinguishability
Summary
Chapter 5. Methods for Synthesizing Data
Generating Synthetic Data from Theory
Sampling from a Multivariate Normal Distribution
Inducing Correlations with Specified Marginal Distributions
Copulas with Known Marginal Distributions
Generating Realistic Synthetic Data
Fitting Real Data to Known Distributions
Using Machine Learning to Fit the Distributions
Hybrid Synthetic Data
Machine Learning Methods
Deep Learning Methods
Synthesizing Sequences
Summary
Chapter 6. Identity Disclosure in Synthetic Data
Types of Disclosure
Identity Disclosure
Learning Something New
Attribute Disclosure
Inferential Disclosure
Meaningful Identity Disclosure
Defining Information Gain
Bringing It All Together
Unique Matches
How Privacy Law Impacts the Creation and Use of Synthetic Data
Issues Under the GDPR
Issues Under the CCPA
Issues Under HIPAA
Article 29 Working Party Opinion
Summary
Chapter 7. Practical Data Synthesis
Managing Data Complexity
For Every Pre-Processing Step There Is a Post-Processing Step
Field Types
The Need for Rules
Not All Fields Have to Be Synthesized
Synthesizing Dates
Synthesizing Geography
Lookup Fields and Tables
Missing Data and Other Data Characteristics
Partial Synthesis
Organizing Data Synthesis
Computing Capacity
A Toolbox of Techniques
Synthesizing Cohorts Versus Full Datasets
Continuous Data Feeds
Privacy Assurance as Certification
Performing Validation Studies to Get Buy-In
Motivated Intruder Tests
Who Owns Synthetic Data?
Conclusions
Index
About the Authors
Colophon