Data Processing with Optimus: Supercharge big data preparation tasks for analytics and machine learning with Optimus using Dask and PySpark

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Written by the core Optimus team, this comprehensive guide will help you to understand how Optimus improves the whole data processing landscape

Key Features

  • Load, merge, and save small and big data efficiently with Optimus
  • Learn Optimus functions for data analytics, feature engineering, machine learning, cross-validation, and NLP
  • Discover how Optimus improves other data frame technologies and helps you speed up your data processing tasks

Book Description

Optimus is a Python library that works as a unified API for data cleaning, processing, and merging data. It can be used for handling small and big data on your local laptop or on remote clusters using CPUs or GPUs.

The book begins by covering the internals of Optimus and how it works in tandem with the existing technologies to serve your data processing needs. You'll then learn how to use Optimus for loading and saving data from text data formats such as CSV and JSON files, exploring binary files such as Excel, and for columnar data processing with Parquet, Avro, and OCR. Next, you'll get to grips with the profiler and its data types - a unique feature of Optimus Dataframe that assists with data quality. You'll see how to use the plots available in Optimus such as histogram, frequency charts, and scatter and box plots, and understand how Optimus lets you connect to libraries such as Plotly and Altair. You'll also delve into advanced applications such as feature engineering, machine learning, cross-validation, and natural language processing functions and explore the advancements in Optimus. Finally, you'll learn how to create data cleaning and transformation functions and add a hypothetical new data processing engine with Optimus.

By the end of this book, you'll be able to improve your data science workflow with Optimus easily.

What you will learn

  • Use over 100 data processing functions over columns and other string-like values
  • Reshape and pivot data to get the output in the required format
  • Find out how to plot histograms, frequency charts, scatter plots, box plots, and more
  • Connect Optimus with popular Python visualization libraries such as Plotly and Altair
  • Apply string clustering techniques to normalize strings
  • Discover functions to explore, fix, and remove poor quality data
  • Use advanced techniques to remove outliers from your data
  • Add engines and custom functions to clean, process, and merge data

Who this book is for

This book is for Python developers who want to explore, transform, and prepare big data for machine learning, analytics, and reporting using Optimus, a unified API to work with Pandas, Dask, cuDF, Dask-cuDF, Vaex, and Spark. Although not necessary, beginner-level knowledge of Python will be helpful. Basic knowledge of the CLI is required to install Optimus and its requirements. For using GPU technologies, you'll need an NVIDIA graphics card compatible with NVIDIA's RAPIDS library, which is compatible with Windows 10 and Linux.

Table of Contents

  1. Hi Optimus!
  2. Data Loading, Saving, and File Formats
  3. Data Wrangling
  4. Combining, Reshaping, and Aggregating Data
  5. Data Visualization and Profiling
  6. String Clustering
  7. Feature Engineering
  8. Machine Learning
  9. Natural Language Processing
  10. Hacking Optimus
  11. Optimus as a Web Service

Author(s): Dr. Argenis Leon, Luis Aguirre
Publisher: Packt Publishing
Year: 2021

Language: English
Commentary: True PDF
Pages: 300

Cover
Title
Copyright and Credits
Table of Contents
Section 1: Getting Started with Optimus
Chapter 1: Hi Optimus!
Technical requirements
Introducing Optimus
Exploring the DataFrame technologies
Examining Optimus design principles
Installing everything you need to run Optimus
Installing Anaconda
Installing Optimus
Installing JupyterLab
Installing RAPIDS
Using Coiled
Using a Docker container
Using Optimus
The Optimus instance
The Optimus DataFrame
Technical details
Discovering Optimus internals
Engines
The DataFrame behind the DataFrame
Meta
Dummy functions
Diagnostics
Summary
Chapter 2: Data Loading, Saving, and File Formats
Technical requirements
How data moves internally
File to RAM
File to GPU memory
Database to RAM
Database to GPU memory
Loading a file
Loading a local CSV file
Wildcards
Loading large files
Loading a file from a remote connection
Loading data from a database
Special dependencies for every technology
Creating a dataframe from scratch
Connecting to remote data sources
Connection credentials
Connecting to databases
Saving a dataframe
Saving to a local file
Saving a file using a remote connection
Saving a dataframe to a database table
Loading and saving data in parallel
Summary
Section 2: Optimus – Transform and Rollout
Chapter 3: Data Wrangling
Technical requirements
Exploring Optimus data types
Converting data types
Operating columns
Selecting columns
Moving columns
Renaming columns
Removing columns
Input and output columns
Managing functions
String functions
Numeric functions
Date and time functions
URL functions
Email functions
Experimenting with user-defined functions
Using apply
Supporting multiple engines
Summary
Further reading
Chapter 4: Combining, Reshaping, and Aggregating Data
Technical requirements
Concatenating data
Mapping
Concatenating columns
Joining data
Reshaping and pivoting
Pivoting
Stacking
Unstacking
Melting
Aggregations
Aggregating and grouping
Summary
Chapter 5: Data Visualization and Profiling
Technical requirements
Data quality
Handling matches, mismatches, and nulls
Exploratory data analysis
Single variable non-graphical methods
Single variable graphical methods
Multi-variable non-graphical methods
Multi-variable graphical methods
Data profiling
Cache flushing
Summary
Chapter 6: String Clustering
Technical requirements
Exploring string clustering
Key collision methods
Fingerprinting
N-gram fingerprinting
Phonetic encoding
Nearest-neighbor methods
Levenshtein distance
Applying suggestions
Summary
Chapter 7: Feature Engineering
Technical requirements
Handling missing values
Removing data
Imputation
Handling outliers
Tukey
Z-score
Modified Z-score
Binning
Variable transformation
Logarithmic transformation
Square root transformation
Reciprocal transformation
Exponential or power transformation
String to index
One-hot encoding
Feature splitting
Scaling
Normalization
Standardization
Max abs scaler
Summary
Section 3: Advanced Features of Optimus
Chapter 8: Machine Learning
Technical requirements
Optimus as a cohesive API
How Optimus can help
Implementing a train-test split procedure
When to use a train-test split procedure
Test size
Repeatable train-test splits
Using k-fold cross-validation
Training models in Optimus
Linear regression
Logistic regression
Model performance
K-means
PCA
Loading and saving models
Summary
Chapter 9: Natural Language Processing
Technical requirements
Natural language processing
Removing unwanted strings
Stripping the HTML
Removing stopwords
Removing URLs
Removing special characters
Expanding contracted words
Stemming and lemmatization
Stemming
Lemmatization
word_tokenizer
Part-of-speech tagging
Applying the transformation
Feature extraction from text
Bag of words
Summary
Chapter 10: Hacking Optimus
Technical requirements
Installing Git
Adding a new engine
Cloning the repository from GitHub
How the project is organized
The entry point
Base class functions
Applying functions
I/O operations
Plots
Profiler data types
Bumblebee
Joining the community
The future
Limitations
Summary
Chapter 11: Optimus as a Web Service
Technical requirements
Introducing Blurr
Setting up the environment
Pre-requisites for Blurr
Installing the package
Importing the package
Creating a Blurr session
Multiple engines in one session
Quickest setup
Making requests
Loading a dataframe
Saving a dataframe
Getting information from the dataset
Transforming a dataset
Passing arguments
Getting the content of the dataset
Multiple operations in one request
Using other types of data
Summary
Index