Data-Centric Machine Learning with Python

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Join the data-centric revolution and master the concepts, techniques, and algorithms shaping the future of AI and ML development, using Python Key Features Grasp the principles of data centricity and apply them to real-world scenarios Gain experience with quality data collection, labeling, and synthetic data creation using Python Develop essential skills for building reliable, responsible, and ethical machine learning solutions Book Description In the rapidly advancing data-driven world where data quality is pivotal to the success of machine learning and artificial intelligence projects, this critically timed guide provides a rare, end-to-end overview of data-centric machine learning (DCML), along with hands-on applications of technical and non-technical approaches to generating deeper and more accurate datasets. This book will help you understand what data-centric ML/AI is and how it can help you to realize the potential of ‘small data’. Delving into the building blocks of data-centric ML/AI, you’ll explore the human aspects of data labeling, tackle ambiguity in labeling, and understand the role of synthetic data. From strategies to improve data collection to techniques for refining and augmenting datasets, you’ll learn everything you need to elevate your data-centric practices. Through applied examples and insights for overcoming challenges, you’ll get a roadmap for implementing data-centric ML/AI in diverse applications in Python. By the end of this book, you’ll have developed a profound understanding of data-centric ML/AI and the proficiency to seamlessly integrate common data-centric approaches in the model development lifecycle to unlock the full potential of your machine learning projects by prioritizing data quality and reliability. What you will learn Understand the impact of input data quality compared to model selection and tuning Recognize the crucial role of subject-matter experts in effective model development Implement data cleaning, labeling, and augmentation best practices Explore common synthetic data generation techniques and their applications Apply synthetic data generation techniques using common Python packages Detect and mitigate bias in a dataset using best-practice techniques Understand the importance of reliability, responsibility, and ethical considerations in ML/AI Who this book is for This book is for data science professionals and machine learning enthusiasts looking to understand the concept of data-centricity, its benefits over a model-centric approach, and the practical application of a best-practice data-centric approach in their work. This book is also for other data professionals and senior leaders who want to explore the tools and techniques to improve data quality and create opportunities for small data ML/AI in their organizations.

Author(s): Jonas Christensen
, Nakul Bajaj
, Manmohan Gosada
Publisher: Packt Publishing Pvt Ltd
Year: 2024

Language: English
Pages: 469

Data-Centric Machine Learning with Python
Foreword
Contributors
About the authors
About the reviewers
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Conventions used
Get in touch
Share Your Thoughts
Download a free PDF copy of this book
Part 1: What Data-Centric Machine Learning Is and Why We Need It
1
Exploring Data-Centric Machine Learning
Understanding data-centric ML
The origins of data centricity
The components of ML systems
Data is the foundational ingredient
Data-centric versus model-centric ML
Data centricity is a team sport
The importance of quality data in ML
Identifying high-value legal cases with natural language processing
Predicting cardiac arrests in emergency calls
Summary
References
2
From Model-Centric to Data-Centric – ML’s Evolution
Exploring why ML development ended up being mostly model-centric
The 1940s to 1970s – the early days
The 1980s to 1990s – the rise of personal computing and the internet
The 2000s – the rise of tech giants
2010–now – big data drives AI innovation
Model-centricity was the logical evolutionary outcome
Unlocking the opportunity for small data ML
Why we need data-centric AI more than ever
The cascading effects of data quality
Avoiding data cascades and technical debt
Summary
References
Part 2: The Building Blocks of Data-Centric ML
3
Principles of Data-Centric ML
Sometimes, all you need is the right data
Principle 1 – data should be the center of ML development
A checklist for data-centricity
Principle 2 – leverage annotators and SMEs effectively
Direct labeling with human annotators
Verifying output quality with human annotators
Codifying labeling rules with programmatic labeling
Principle 3 – use ML to improve your data
Principle 4 – follow ethical, responsible, and well-governed ML practices
Summary
References
4
Data Labeling Is a Collaborative Process
Understanding the benefits of diverse human labeling
Understanding common challenges arising from human labelers
Designing a framework for high-quality labels
Designing clear instructions
Aligning motivations and using SMEs
Collaborating iteratively
Dealing with ambiguity and reflecting diversity
Understanding approaches for dealing with ambiguity in labeling
Measuring labeling consistency
Summary
References
Part 3: Technical Approaches to Better Data
5
Techniques for Data Cleaning
The six key dimensions of data quality
Installing the required packages
Introducing the dataset
Ensuring the data is consistent
Checking that the data is unique
Ensuring that the data is complete and not missing
Ensuring that the data is valid
Ensuring that the data is accurate
Ensuring that the data is fresh
Summary
6
Techniques for Programmatic Labeling in Machine Learning
Technical requirements
Python version
Library requirements
Pattern matching
Database lookup
Boolean flags
Weak supervision
Semi-weak supervision
Slicing functions
Active learning
Uncertainty sampling
Query by Committee (QBC)
Diversity sampling
Transfer learning
Feature extraction
Fine-tuning pre-trained models
Semi-supervised learning
Summary
7
Using Synthetic Data in Data-Centric Machine Learning
Understanding synthetic data
The use case for synthetic data
Synthetic data for computer vision and image and video processing
Generating synthetic data using generative adversarial networks (GANs)
Exploring image augmentation with a practical example
Natural language processing
Privacy preservation
Generating synthetic data for privacy preservation
Using synthetic data to improve model performance
When should you use synthetic data?
Summary
References
8
Techniques for Identifying and Removing Bias
The bias conundrum
Types of bias
Easy to identify bias
Difficult to identify bias
The data-centric imperative
Sampling methods
Other data-centric techniques
Case study
Loading the libraries
AllKNN undersampling method
Instance hardness undersampling method
Oversampling methods
Shapley values to detect bias, oversample, and undersample data
Summary
9
Dealing with Edge Cases and Rare Events in Machine Learning
Importance of detecting rare events and edge cases in machine learning
Statistical methods
Z-scores
Interquartile Range (IQR)
Box plots
Scatter plots
Anomaly detection
Unsupervised method using Isolation Forest
Semi-supervised methods using autoencoders
Supervised methods using SVMs
Data augmentation and resampling techniques
Oversampling using SMOTE
Undersampling using RandomUnderSampler
Cost-sensitive learning
Choosing evaluation metrics
Ensemble techniques
Bagging
Boosting
Stacking
Summary
Part 4: Getting Started with Data-Centric ML
10
Kick-Starting Your Journey in Data-Centric Machine Learning
Solving six common ML challenges
Being a champion for data quality
Bringing people together
Taking accountability for AI ethics and fairness
Making data everyone’s business – our own experience
Summary
References
Index
Why subscribe?
Other Books You May Enjoy
Packt is searching for authors like you
Share Your Thoughts
Download a free PDF copy of this book