Effective Data Science Infrastructure: How to make data scientists productive

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Simplify data science infrastructure to give data scientists an efficient path from prototype to production. In Effective Data Science Infrastructure you will learn how to: • Design data science infrastructure that boosts productivity • Handle compute and orchestration in the cloud • Deploy machine learning to production • Monitor and manage performance and results • Combine cloud-based tools into a cohesive data science environment • Develop reproducible data science projects using Metaflow, Conda, and Docker • Architect complex applications for multiple teams and large datasets • Customize and grow data science infrastructure Effective Data Science Infrastructure: How to make data scientists more productive is a hands-on guide to assembling infrastructure for data science and machine learning applications. It reveals the processes used at Netflix and other data-driven companies to manage their cutting edge data infrastructure. In it, you’ll master scalable techniques for data storage, computation, experiment tracking, and orchestration that are relevant to companies of all shapes and sizes. You’ll learn how you can make data scientists more productive with your existing cloud infrastructure, a stack of open source software, and idiomatic Python. The author is donating proceeds from this book to charities that support women and underrepresented groups in data science. about the technology Growing data science projects from prototype to production requires reliable infrastructure. Using the powerful new techniques and tooling in this book, you can stand up an infrastructure stack that will scale with any organization, from startups to the largest enterprises. About the book Effective Data Science Infrastructure teaches you to build data pipelines and project workflows that will supercharge data scientists and their projects. Based on state-of-the-art tools and concepts that power data operations of Netflix, this book introduces a customizable cloud-based approach to model development and MLOps that you can easily adapt to your company’s specific needs. As you roll out these practical processes, your teams will produce better and faster results when applying data science and machine learning to a wide array of business problems. What's inside • Handle compute and orchestration in the cloud • Combine cloud-based tools into a cohesive data science environment • Develop reproducible data science projects using Metaflow, AWS, and the Python data ecosystem • Architect complex applications that require large datasets and models, and a team of data scientists About the reader For infrastructure engineers and engineering-minded data scientists who are familiar with Python. About the author At Netflix, Ville Tuulos designed and built Metaflow, a full-stack framework for data science. Currently, he is the CEO of a startup focusing on data science infrastructure.

Author(s): Ville Tuulos
Edition: 1
Publisher: Manning Publications
Year: 2022

Language: English
Commentary: Vector PDF
Pages: 352
City: Shelter Island, NY
Tags: Machine Learning; Data Science; Python; Scalability; Anaconda; Production Models; Productivity; Directed Acyclic Graphs; MLOps; Full-Stack Development; Metaflow

Effective Data Science Infrastructure
contents
foreword
preface
acknowledgments
about this book
Who should read this book?
How this book is organized: A road map
About the code
liveBook discussion forum
Other online resources
about the author
about the cover illustration
Chapter 1: Introducing data science infrastructure
1.1 Why data science infrastructure?
1.1.1 The life cycle of a data science project
1.2 What is data science infrastructure?
1.2.1 The infrastructure stack for data science
1.2.2 Supporting the full life cycle of a data science project
1.2.3 One size doesn’t fit all
1.3 Why good infrastructure matters
1.3.1 Managing complexity
1.3.2 Leveraging existing platforms
1.4 Human-centric infrastructure
1.4.1 Freedom and responsibility
1.4.2 Data scientist autonomy
Chapter 2: The toolchain of data science
2.1 Setting up a development environment
2.1.1 Cloud account
2.1.2 Data science workstation
2.1.3 Notebooks
2.1.4 Putting everything together
2.2 Introducing workflows
2.2.1 The basics of workflows
2.2.2 Executing workflows
2.2.3 The world of workflow frameworks
Chapter 3: Introducing Metaflow
3.1 The basics of Metaflow
3.1.1 Installing Metaflow
3.1.2 Writing a basic workflow
3.1.3 Managing data flow in workflows
3.1.4 Parameters
3.2 Branching and merging
3.2.1 Valid DAG structures
3.2.2 Static branches
3.2.3 Dynamic branches
3.2.4 Controlling concurrency
3.3 Metaflow in Action
3.3.1 Starting a new project
3.3.2 Accessing results with the Client API
3.3.3 Debugging failures
3.3.4 Finishing touches
Chapter 4: Scaling with the compute layer
4.1 What is scalability?
4.1.1 Scalability across the stack
4.1.2 Culture of experimentation
4.2 The compute layer
4.2.1 Batch processing with containers
4.2.2 Examples of compute layers
4.3 The compute layer in Metaflow
4.3.1 Configuring AWS Batch for Metaflow
4.3.2 @batch and @resources decorators
4.4 Handling failures
4.4.1 Recovering from transient errors with @retry
4.4.2 Killing zombies with @timeout
4.4.3 The decorator of last resort: @catch
Chapter 5: Practicing scalability and performance
5.1 Starting simple: Vertical scalability
5.1.1 Example: Clustering Yelp reviews
5.1.2 Practicing vertical scalability
5.1.3 Why vertical scalability?
5.2 Practicing horizontal scalability
5.2.1 Why horizontal scalability?
5.2.2 Example: Hyperparameter search
5.3 Practicing performance optimization
5.3.1 Example: Computing a co-occurrence matrix
5.3.2 Recipe for fast-enough workflows
Chapter 6: Going to production
6.1 Stable workflow scheduling
6.1.1 Centralized metadata
6.1.2 Using AWS Step Functions with Metaflow
6.1.3 Scheduling runs with @schedule
6.2 Stable execution environments
6.2.1 How Metaflow packages flows
6.2.2 Why dependency managements matters
6.2.3 Using the @conda decorator
6.3 Stable operations
6.3.1 Namespaces during prototyping
6.3.2 Production namespaces
6.3.3 Parallel deployments with @project
Chapter 7: Processing data
7.1 Foundations of fast data
7.1.1 Loading data from S3
7.1.2 Working with tabular data
7.1.3 The in-memory data stack
7.2 Interfacing with data infrastructure
7.2.1 Modern data infrastructure
7.2.2 Preparing datasets in SQL
7.2.3 Distributed data processing
7.3 From data to features
7.3.1 Distinguishing facts and features
7.3.2 Encoding features
Chapter 8: Using and operating models
8.1 Producing predictions
8.1.1 Batch, streaming, and real-time predictions
8.1.2 Example: Recommendation system
8.1.3 Batch predictions
8.1.4 Real-time predictions
Chapter 9: Machine learning with the full stack
9.1 Pluggable feature encoders and models
9.1.1 Developing a framework for pluggable components
9.1.2 Executing feature encoders 
9.1.3 Benchmarking models
9.2 Deep regression model
9.2.1 Encoding input tensors
9.2.2 Defining a deep regression model
9.2.3 Training a deep regression model
9.3 Summarizing lessons learned
appendix: Installing Conda
index
Symbols
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
Y
Z