Author(s): Chris Albon
Publisher: O'Reilly
Year: 2017
Chapter 1. 1.0 Introduction The first step in any machine learning endeavor is get to the raw data into our system. The raw data can be held in a log file, dataset file, or database. Furthermore, often we will want to get data from multiple sources. The recipes in this chapter look at methods of loading data from a variety of sources including CSV files and SQL databases. We also cover methods of generating simulated data with desirable properties for experimentation. Finally, while there are many ways to load data in the Python ecosystem, we will focus on using the pandas library’s extensive set of methods for loading external data and scikit-learn -- an open source machine learning library Python -- for generating simulated data. 1.1 Loading A Sample Dataset Problem You need to load a pre-existing sample dataset. Solution scikit-learn comes with a number of popular datasets for you to use. # Load scikit-learn's datasets from sklearn import datasets # Load the digits dataset digits =
Chapter 1. 1.0 Introduction
Chapter 2. 2.0 Introduction Data wrangling is a broad term use, often informally, to describe the process of transforming raw data to a clean and organized format ready for further preprocessing, or final use. For us, data wrangling is only one step in preprocessing our data, but it is an important step. The most common data structure used to “wrangle” data is the data frame, which can be both intuitive and incredibly versatile. Data frames are tabular, meaning that they are based on rows and columns like you would see in a spreadsheet. Here is a data frame created from data about passengers on the Titanic: # Load library import pandas as pd # Create URL url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/titanic.csv' # Load data df = pd.read_csv(url) # Show the first 5 rows df.head(5) Name PClass Age Sex Survived SexCode 0 Allen, Miss Elisabeth Walton 1st 29.00 female 1 1 1 Allison, Miss Helen Loraine 1st 2.00 female 0 1 2 Allison, Mr Hudson Joshua Creighton
Chapter 2. 2.0 Introduction
Chapter 3. 3.0 Introduction Quantitative data is the measurement of something -- whether class size, monthly sales, or student scores. The natural way to represent these quantities is numerically (e.g. 29 students, $529,392 in sales, etc.). In this chapter, we will cover numerous strategies for transforming raw numerical data into features purpose-built for machine learning algorithms. 3.1 Rescaling A Feature Problem You need to rescale the values of a numerical feature to be between two values. Solution Use scikit-learn’s MinMaxScaler to rescale a feature array: # Load libraries from sklearn import preprocessing import numpy as np # Create feature x = np.array([[-500.5], [-100.1], [0], [100.1], [900.9]]) # Create scaler minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1)) # Scale feature x_scale = minmax_scale.fit_transform(x) # Show feature x_scale array([[ 0. ], [ 0.28571429], [ 0.35714286], [ 0.42857143], [ 1. ]]) Discussion Rescaling is a common preprocessing task in ma
Chapter 3. 3.0 Introduction
Blank Page