Building Data Science Solutions with Anaconda: A comprehensive starter guide to building robust and complete models

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

The missing manual to becoming a successful data scientist―develop the skills to use key tools and the knowledge to thrive in the AI/ML landscape

Key Features

  • Learn from an AI patent-holding engineering manager with deep experience in Anaconda tools and OSS
  • Get to grips with critical aspects of data science such as bias in datasets and interpretability of models
  • Gain a deeper understanding of the AI/ML landscape through real-world examples and practical analogies

Book Description

You might already know that there's a wealth of data science and machine learning resources available on the market, but what you might not know is how much is left out by most of these AI resources. This book not only covers everything you need to know about algorithm families but also ensures that you become an expert in everything, from the critical aspects of avoiding bias in data to model interpretability, which have now become must-have skills.

In this book, you'll learn how using Anaconda as the easy button, can give you a complete view of the capabilities of tools such as conda, which includes how to specify new channels to pull in any package you want as well as discovering new open source tools at your disposal. You'll also get a clear picture of how to evaluate which model to train and identify when they have become unusable due to drift. Finally, you'll learn about the powerful yet simple techniques that you can use to explain how your model works.

By the end of this book, you'll feel confident using conda and Anaconda Navigator to manage dependencies and gain a thorough understanding of the end-to-end data science workflow.

What you will learn

  • Install packages and create virtual environments using conda
  • Understand the landscape of open source software and assess new tools
  • Use scikit-learn to train and evaluate model approaches
  • Detect bias types in your data and what you can do to prevent it
  • Grow your skillset with tools such as NumPy, pandas, and Jupyter Notebooks
  • Solve common dataset issues, such as imbalanced and missing data
  • Use LIME and SHAP to interpret and explain black-box models

Who this book is for

If you're a data analyst or data science professional looking to make the most of Anaconda's capabilities and deepen your understanding of data science workflows, then this book is for you. You don't need any prior experience with Anaconda, but a working knowledge of Python and data science basics is a must.

Table of Contents

  1. Understanding the AI/ML Landscape
  2. Analyzing Open Source Software
  3. Using Anaconda Distribution to Manage Packages
  4. Working with Jupyter Notebooks and NumPy
  5. Cleaning and Visualizing Data
  6. Overcoming Bias in AI/ML
  7. Choosing the Best AI Algorithm
  8. Dealing with Common Data Problems
  9. Building a Regression Model with scikit-learn
  10. Explainable AI - Using LIME and SHAP
  11. Tuning Hyperparameters and Versioning Your Model

Author(s): Dan Meador
Publisher: Packt Publishing
Year: 2022

Language: English
Pages: 330

Cover
Title page
Copyright and Credits
Foreword
Contributors
Table of Contents
Preface
Part 1: The Data Science Landscape – Open Source to the Rescue
Chapter 1: Understanding the AI/ML landscape
Introducing Artificial Intelligence (AI)
Defining AI
Defining a data scientist
Understanding the current state of AI and ML
Knowing the difference between AI and ML
Understanding the massive generation of new data
Evaluating how AI delivers business value
Understanding the main types of ML models
Supervised learning
Unsupervised learning
Reinforcement learning
Evaluating the problem type
Dealing with out-of-date models
Difference between online and batch learning
How models become stale: model drift
Installing packages with Anaconda
How to use Anaconda Individual Edition to download packages
How to handle dependencies with conda
Creating separate work areas with Anaconda environments
Summary
Chapter 2: Analyzing Open Source Software
Technical requirements
Understanding open source
Forking an OSS repository with Git and GitHub
Defining open source software
Advantages of OSS
Understanding the top four OSS licenses
Copyleft versus permissive licenses
How to find out what license a library uses
Evaluating a new tool or library
GitHub stars
Age
How long since it's been updated
Number of maintainers
Age of open issues/PRs
Number of external dependencies
Importing packages with Anaconda and conda-forge
Updating to the latest conda version
Creating a conda virtual environment
The differences between modules, packages, and libraries
Evaluating and using scikit-learn
Evaluation metrics
Getting up and running with scikit-learn
Summary
Chapter 3: Using the Anaconda Distribution to Manage Packages
Technical requirements
Learning how dependency resolution works
How pip and conda are different
Discovering what conda environments are and how to use them
Creating environments in conda
Creating environments in Navigator
Installing packages via Navigator
Installing packages via conda
Exporting environments to Anconda.org
Managing channels with Anaconda Navigator and conda
Understanding what a channel is
Setting channel priority
Using advanced conda info and settings
Using conda info to see configuration information
Setting up your conda settings file
Conda cheat sheet
Conda general commands
Conda environment commands
Summary
Chapter 4: Working with Jupyter Notebooks and NumPy
Technical requirements
Working with Jupyter notebooks
Creating a new Jupyter notebook
Working with Jupyter notebook cells
Line and cell magic in Jupyter cells
Accessing the system command line
Using NumPy to perform calculations quickly
Creating and manipulating NumPy arrays
Understanding why NumPy's ndarrays are fast
Summary
Part 2: Data Is the New Oil, Models Are the New Refineries
Chapter 5: Cleaning and Visualizing Data
Technical requirements
Cleaning data with pandas
Installing pandas in your conda environment
Working with CSVs
Analyzing and cleaning data
Dealing with missing data
Creating a deep copy of a Data Frame
Visualization with Matplotlib
Preparing data for plotting
Plotting data
Customizing the plot
Showing the plot
Plotting a scatter plot and polynomial regression line
Summary
Chapter 6: Overcoming Bias in AI/ML
Technical requirements
Defining bias versus discrimination
Bias in AI/ML
Discrimination in AI/ML
Overcoming proxy bias
Examples of proxy bias
How to prevent proxy bias
Overcoming sample bias
Examples of sample bias
Racial/gender bias
How to prevent sample bias
Overcoming exclusion bias
Examples of exclusion bias
How to prevent exclusion bias
Overcoming measurement bias
Examples of measurement bias
How to prevent measurement bias
Overcoming societal AI bias
Examples of societal bias
Finding bias in an example
Summary
Chapter 7: Choosing the Best AI Algorithm
Technical requirements
Defining your problem
Model problem types
Algorithms by problem type
Understanding regression problems with examples
Linear regression
Random forest
Support vector machines
Artificial neural networks
Classification
Classification algorithms
Classification example
Logistic regression
Decision trees/random forest
K-nearest neighbors
Anomaly detection
One-class SVM
Isolation forests
Clustering problems
DBScan
K-means clustering
Summary
Chapter 8: Dealing with Common Data Problems
Technical requirements
Dealing with too much data
Checking feature correlation
Detecting NaN values
Dealing with valid NaN values
Dealing with invalid NaN values
Finding and correcting data entries
Retrieving specific pandas items by condition
Working with categorical values with one-hot encoding
One-hot encoding with pandas
Ordinal encoding
Feature scaling
Creating a histogram with pandas
Using the R2 score to evaluate a model
Using the MSE score to evaluate a model
Using the MAE score to evaluate a model
Overcoming the limits of capped values
Recovering the raw dataset
Working with date formats
Summary
Part 3: Practical Examples and Applications
Chapter 9: Building a Regression Model with scikit-learn
Technical requirements
Walking through the data science workflow
Setting up and understanding the problem space
Setting up your workspace
Combining two CSV files
Exploring and cleaning the data
Checking for missing values
Checking for redundant features
Focusing on the key features
Creating and evaluating regression algorithms
Comparing regression and classification
Preparing the data for training
Evaluating potential models using MSE and R2 scores
Training your models
Analyzing model results with MSE and R2 score
R2 score
Training a KNN model
Linear regression
Making use of our results
Summary
Chapter 10: Explainable AI - Using LIME and SHAP
Technical requirements
Understanding the value of interpretation
Knowing the difference between interpreting and explaining
Looking at legal reasons for interpretability
Looking at moral reasons for interpretability
Looking at business reasons for interpretability
Looking at model improvement reasons for interpretability
Understanding models that are interpretable by design
Interpreting decision trees
Graphing a decision tree
Explaining a model's outcome with LIME
Creating a LIME example
Weighing the drawbacks of LIME
Explaining a model's outcome with SHAP
Avoid confusion with Shapley values
Creating a SHAP example
Looking at the SHAP result
Weighing the drawbacks of SHAP
Thinking through shortcomings of interpretation and XAI
Summary
Chapter 11: Tuning Hyperparameters and Versioning Your Model
Technical requirements 
Creating a scikit-learn pipeline
scikit-learn estimators and transformers
Creating a scikit-learn pipeline
Testing out various algorithm methods
Feeding live production data into pipelines
Finding optimal hyperparameters with GridSearchCV
Defining the difference between hyperparameters and parameters
Using a grid search on a random forest pipeline
Versioning and storing your model
Pickling a model
Loading your pickled model
Storing your model with joblib
Summary
Close
Index
Other Books You May Enjoy