The Pragmatic Programmer for Machine Learning: Engineering Analytics and Data Science Solutions

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Machine learning has redefined the way we work with data and is increasingly becoming an indispensable part of everyday life. The Pragmatic Programmer for Machine Learning: Engineering Analytics and Data Science Solutions discusses how modern software engineering practices are part of this revolution both conceptually and in practical applictions. Comprising a broad overview of how to design machine learning pipelines as well as the state-of-the-art tools we use to make them, this book provides a multi-disciplinary view of how traditional software engineering can be adapted to and integrated with the workflows of domain experts and probabilistic models. From choosing the right hardware to designing effective pipelines architectures and adopting software development best practices, this guide will appeal to machine learning and data science specialists, whilst also laying out key high-level principlesin a way that is approachable for students of computer science and aspiring programmers.

Author(s): Marco Scutari, Mauro Malvestio
Series: Chapman & Hall/CRC Machine Learning & Pattern Recognition
Publisher: CRC Press/Chapman & Hall
Year: 2023

Language: English
Pages: 356
City: Boca Raton

Cover
Half Title
Series Page
Title Page
Copyright Page
Dedication
Contents
Preface
1. What Is This Book About?
1.1. Machine Learning
1.2. Data Science
1.3. Software Engineering
1.4. How Do They Go Together?
I. Foundations of Scientific Computing
2. Hardware Architectures
2.1. Types of Hardware
2.1.1. Compute
2.1.2. Memory
2.1.3. Connections
2.2. Making Hardware Live Up to Expectations
2.3. Local and Remote Hardware
2.4. Choosing the Right Hardware for the Job
3. Variable Types and Data Structures
3.1. Variable Types
3.1.1. Integers
3.1.2. Floating Point
3.1.3. Strings
3.2. Data Structures
3.2.1. Vectors and Lists
3.2.2. Representing Data with Data Frames
3.2.3. Dense and Sparse Matrices
3.3. Choosing the Right Variable Types for the Job
3.4. Choosing the Right Data Structures for the Job
4. Analysis of Algorithms
4.1. Writing Pseudocode
4.2. Computational Complexity and Big-O Notation
4.3. Big-O Notation and Benchmarking
4.4. Algorithm Analysis for Machine Learning
4.5. Some Examples of Algorithm Analysis
4.5.1. Estimating Linear Regression Models
4.5.2. Sparse Matrices Representation
4.5.3. Uniform Simulations of Directed Acyclic Graphs
4.6. Big-O Notation and Real-World Performance
II. Best Practices for Machine Learning Pipelines
5. Designing and Structuring Pipelines
5.1. Data as Code
5.2. Technical Debt
5.2.1. At the Data Level
5.2.2. At the Model Level
5.2.3. At the Architecture (Design) Level
5.2.4. At the Code Level
5.3. Machine Learning Pipeline
5.3.1. Project Scoping
5.3.2. Producing a Baseline Implementation
5.3.3. Data Ingestion and Preparation
5.3.4. Model Training, Evaluation and Validation
5.3.5. Deployment, Serving and Inference
5.3.6. Monitoring, Logging and Reporting
6. Writing Machine Learning Code
6.1. Choosing Languages and Libraries
6.2. Naming Things
6.3. Coding Styles and Coding Standards
6.4. Filesystem Structure
6.5. Effective Versioning
6.6. Code Review
6.7. Refactoring
6.8. Reworking Academic Code: An Example
7. Packaging and Deploying Pipelines
7.1. Model Packaging
7.1.1. Standalone Packaging
7.1.2. Programming Language Package Managers
7.1.3. Virtual Machines
7.1.4. Containers
7.2. Model Deployment: Strategies
7.3. Model Deployment: Infrastructure
7.4. Model Deployment: Monitoring and Logging
7.5. What Can Possibly Go Wrong?
7.6. Rolling Back
8. Documenting Pipelines
8.1. Comments
8.2. Documenting Public Interfaces
8.3. Documenting Architecture and Design
8.4. Documenting Algorithms and Business Cases
8.5. Illustrating Practical Use Cases
9. Troubleshooting and Testing Pipelines
9.1. Data Are the Problem
9.1.1. Large Data
9.1.2. Heterogeneous Data
9.1.3. Dynamic Data
9.2. Models Are the Problem
9.2.1. Large Models
9.2.2. Black-Box Models
9.2.3. Costly Models
9.2.4. Many Models
9.3. Common Signs That Something Is Up
9.4. Tests Are the Solution
9.4.1. What Do We Want to Achieve?
9.4.2. What Should We Test?
9.4.3. Offline and Online Data
9.4.4. Testing Local and Testing Global
9.4.5. Conceptual and Implementation Errors
9.4.6. Code Coverage and Test Prioritisation
III. Tools and Technologies
10. Tools for Developing Pipelines
10.1. Data Exploration and Experiment Tracking
10.2. Code Development
10.2.1. Code Editors and IDEs
10.2.2. Notebooks
10.2.3. Accessing Data and Documentation
10.3. Build, Test and Documentation Tools
11. Tools to Manage Pipelines in Production
11.1. Infrastructure Management
11.2. Machine Learning Software Management
11.3. Dashboards, Visualisation and Reporting
IV. A Case Study
12. Recommending Recommendations: A Recommender System Using Natural Language Understanding
12.1. The Domain Problem
12.2. The Machine Learning Model
12.3. The Infrastructure
12.4. The Architecture of the Pipeline
12.4.1. Data Ingestion and Data Preparation
12.4.2. Data Tracking and Versioning
12.4.3. Training and Experiment Tracking
12.4.4. Model Packaging
12.4.5. Deployment and Inference
Bibliography
Index