Software Engineering for Data Scientists (MEAP v2)

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

These easy to learn and apply software engineering techniques will radically improve collaboration, scaling, and deployment in your Data Science projects. In Software Engineering for Data Scientists you’ll learn to improve performance and efficiency by: Using source control Handling exceptions and errors in your code Improving the design of your tools and applications Scaling code to handle large data efficiently Testing model and data processing code before deployment Scheduling a model to run automatically Packaging Python code into reusable libraries Generating automated reports for monitoring a model in production Software Engineering for Data Scientists presents important software engineering principles that will radically improve the performance and efficiency of Data Science projects. Author and Meta data scientist Andrew Treadway has spent over a decade guiding models and pipelines to production. This practical handbook is full of his sage advice that will change the way you structure your code, monitor model performance, and work effectively with the software engineering teams. Jupyter Notebook is a popular tool for data scientists because it integrates coding with being able to visualize the results of code, such as plots or tables, all in one seamless environment. While you can commit Jupyter Notebook files, just like most other files, it can be more difficult to handle merge conflicts when two users modify the same notebook. This is because Jupyter Notebook files are more complex than simple Python files or R files (these are not much different than plain text files). Jupyter Notebook files are comprised of HTML, markdown, source code, and potentially images all embedded inside JSON. Thus, trying to programmatically identify the differences between files using Git is quite challenging. However, there are a few alternatives to easily identify the differences between two Jupyter Notebook files. One alternative is a Python package called, which we’ll dive into next. about the technology Many basic software engineering skills apply directly to Data Science! As a data scientist, learning the right software engineering techniques can save you a world of time and frustration. Source control simplifies sharing, tracking, and backing up code. Testing helps reduce future errors in your models or pipelines. Exception handling automatically responds to unexpected events as they crop up. Using established engineering conventions makes it easy to collaborate with software developers. This book teaches you to handle these situations and more in your Data Science projects.

Author(s): Andrew Treadway
Publisher: Manning Publications
Year: 2023

Language: English
Commentary: MEAP v2
Pages: 213

Software Engineering for Data Scientists MEAP V02
Copyright
welcome
brief contents
Chapter 1: Introducing engineering principles
1.1 What do data scientists need to know about software engineering?
1.2 When do we need software engineering principles?
1.2.1 Sample data science workflow
1.2.2 How does software engineering come into the picture?
1.3 What are the components of a data pipeline?
1.3.1 Real-world example: Building a model to predict customer churn
1.4 Deploying models with machine learning pipelines
1.4.1 Data ingestion
1.4.2 Pre-processing
1.4.3 Model training
1.4.4 Model evaluation
1.4.5 Model prediction / deployment
1.4.6 Model monitoring
1.5 Summary
Chapter 2: Source control for data scientists
2.1 What is source control?
2.2 Why do data scientists need to know source control?
2.3 Introducing git
2.3.1 Basic git commands
2.4 Git workflow from scratch
2.4.1 Uploading local repository changes to a remote repository
2.4.2 How to see who made commits
2.5 Getting the latest changes from a remote repository
2.6 Conflicts and merging changes from different users
2.7 How to work with branches in Git
2.7.1 Git commands for branches
2.7.2 Summarizing Git commands
2.7.3 Best practices for using source control
2.8 Comparing Jupyter Notebook files with nbdime
2.8.1 Using the nbdime package
2.9 Summary
2.10 Practice on your own
Chapter 3: How to write robust code
3.1 Improving the structure of your code
3.1.1 PEP8 standards
3.1.2 Using pylint to automatically check the formatting and style of your code
3.1.3 Auto-formatting your code to meet styling guidelines
3.2 Avoiding repetitive code
3.2.1 Modularizing your code
3.3 Restricting inputs to functions
3.4 Clean code summary
3.5 How to implement exception handling in Python
3.5.1 What types of errors can we get in Python?
3.5.2 Using try/except to bypass errors
3.5.3 How to raise errors
3.6 Documentation
3.6.1 Using docstrings
3.6.2 Using pdoc to auto-create documentation
3.7 Summary
3.8 Practice on your own
Chapter 4: Object-oriented programming for data scientists
4.1 Introducing classes and objects
4.1.1 You’re using OOP without knowing it…​
4.2 Creating a class in Python
4.2.1 Creating your own methods
4.2.2 Creating a new ML model class
4.3 Summary
4.4 Practice on your own
Chapter 5: Creating progress bars and time-outs in Python
5.1 Creating progress bars and timing Python processes
5.2 Monitoring the progress of training ML models
5.2.1 Monitoring hyperparameter tuning
5.3 How to auto-stop long-running code
5.4 Summary
5.5 Practice on your own
Chapter 6: Making your code faster and more efficient
6.1 Slow code walk through
6.1.1 Don’t repeat yourself (DRY)
6.1.2 Line profiler
6.1.3 Reducing loops
6.2 Parallelization
6.2.1 Vectorization
6.2.2 Multiprocessing
6.2.3 Training ML models with parallelization
6.3 Caching
6.4 Data structures at scale
6.4.1 Sets
6.4.2 Priority Queues
6.4.3 NumPy arrays
6.5 What’s next for computational efficiency?
6.6 Summary
6.7 Practice on your own
Chapter 7: Memory management with Python
7.1 Memory profiler
7.1.1 High-level memory summaries with guppy
7.1.2 Analyzing your memory consumption line by line with memory-profiler
7.2 Sampling and chunking large datasets
7.2.1 Reading from a large CSV file using chunks
7.2.2 Random selection
7.2.3 Chunking when reading from a database
7.3 Optimizing data types for memory
7.3.1 Checking data types
7.3.2 How to check the memory usage of a data frame
7.3.3 How to check the memory usage of a column
7.3.4 Converting numeric data types to be more memory-efficient
7.3.5 Category data type
7.3.6 Sparse data type
7.3.7 Specifying data types when reading in a dataset
7.3.8 Summary of data types and memory
7.3.9 Limiting number of columns
7.4 Processing workflow for individual chunks of data
7.4.1 Cleaning and feature engineering for big datasets
7.4.2 Training a logistic regression model when the data doesn’t fit in memory
7.5 Additional tips for saving memory with Pandas and Python
7.5.1 Avoid creating extraneous variables
7.5.2 Avoiding global variables
7.5.3 NumPy arrays vs. Lists
7.5.4 Alternatives to pandas
7.6 Summary
7.7 Practice on your own