Let Python do the heavy lifting for you as you analyze large datasets
Python for Data Science For Dummies lets you get your hands dirty with data using one of the top programming languages. This beginner’s guide takes you step by step through getting started, performing data analysis, understanding datasets and example code, working with Google Colab, sampling data, and beyond. Coding your data analysis tasks will make your life easier, make you more in-demand as an employee, and open the door to valuable knowledge and insights. This new edition is updated for the latest version of Python and includes current, relevant data examples.
Get a firm background in the basics of Python coding for data analysis
Learn about data science careers you can pursue with Python coding skills
Integrate data analysis with multimedia and graphics
Manage and organize data with cloud-based relational databases
Python careers are on the rise. Grab this user-friendly Dummies guide and gain the programming skills you need to become a data pro.
Author(s): John Paul Mueller, Luca Massaron
Edition: 3
Publisher: Wiley-Scrivener
Year: 2023
Language: English
Pages: 467
Title Page
Copyright Page
Table of Contents
Introduction
About This Book
Foolish Assumptions
Icons Used in This Book
Beyond the Book
Where to Go from Here
Part 1 Getting Started with Data Science and Python
Chapter 1 Discovering the Match between Data Science and Python
Understanding Python as a Language
Viewing Python’s various uses as a general-purpose language
Interpreting Python
Compiling Python
Defining Data Science
Considering the emergence of data science
Outlining the core competencies of a data scientist
Linking data science, big data, and AI
Creating the Data Science Pipeline
Understanding Python’s Role in Data Science
Considering the shifting profile of data scientists
Working with a multipurpose, simple, and efficient language
Learning to Use Python Fast
Loading data
Training a model
Viewing a result
Chapter 2 Introducing Python’s Capabilities and Wonders
Working with Python
Contributing to data science
Getting a taste of the language
Understanding the need for indentation
Working with Jupyter Notebook and Google Colab
Performing Rapid Prototyping and Experimentation
Considering Speed of Execution
Visualizing Power
Using the Python Ecosystem for Data Science
Accessing scientific tools using SciPy
Performing fundamental scientific computing using NumPy
Performing data analysis using pandas
Implementing machine learning using Scikit-learn
Going for deep learning with Keras and TensorFlow
Performing analysis efficiently using XGBoost
Plotting the data using Matplotlib
Creating graphs with NetworkX
Chapter 3 Setting Up Python for Data Science
Working with Anaconda
Using Jupyter Notebook
Accessing the Anaconda Prompt
Installing Anaconda on Windows
Installing Anaconda on Linux
Installing Anaconda on Mac OS X
Downloading the Datasets and Example Code
Using Jupyter Notebook
Starting Jupyter Notebook
Stopping the Jupyter Notebook server
Defining the code repository
Defining a new folder
Creating a new notebook
Adding notebook content
Exporting a notebook
Removing a notebook
Importing a notebook
Understanding the datasets used in this book
Chapter 4 Working with Google Colab
Defining Google Colab
Understanding what Google Colab does
Considering the online coding difference
Using local runtime support
Working with Notebooks
Creating a new notebook
Opening existing notebooks
Using Google Drive for existing notebooks
Using GitHub for existing notebooks
Using local storage for existing notebooks
Saving notebooks
Using Drive to save notebooks
Using GitHub to save notebooks
Using GitHub gists to save notebooks
Downloading notebooks
Performing Common Tasks
Creating code cells
Creating text cells
Creating special cells
Editing cells
Moving cells
Using Hardware Acceleration
Executing the Code
Viewing Your Notebook
Displaying the table of contents
Getting notebook information
Checking code execution
Sharing Your Notebook
Getting Help
Part 2 Getting Your Hands Dirty with Data
Chapter 5 Working with Jupyter Notebook
Using Jupyter Notebook
Working with styles
Getting Python help
Using magic functions
Obtaining the magic functions list
Working with magic functions
Discovering objects
Getting object help
Obtaining object specifics
Using extended Python object help
Restarting the kernel
Restoring a checkpoint
Performing Multimedia and Graphic Integration
Embedding plots and other images
Loading examples from online sites
Obtaining online graphics and multimedia
Chapter 6 Working with Real Data
Uploading, Streaming, and Sampling Data
Uploading small amounts of data into memory
Streaming large amounts of data into memory
Generating variations on image data
Sampling data in different ways
Accessing Data in Structured Flat-File Form
Reading from a text file
Reading CSV delimited format
Reading Excel and other Microsoft Office files
Sending Data in Unstructured File Form
Managing Data from Relational Databases
Interacting with Data from NoSQL Databases
Accessing Data from the Web
Chapter 7 Processing Your Data
Juggling between NumPy and pandas
Knowing when to use NumPy
Knowing when to use pandas
Validating Your Data
Figuring out what’s in your data
Removing duplicates
Creating a data map and data plan
Manipulating Categorical Variables
Creating categorical variables
Renaming levels
Combining levels
Dealing with Dates in Your Data
Formatting date and time values
Using the right time transformation
Dealing with Missing Data
Finding the missing data
Encoding missingness
Imputing missing data
Slicing and Dicing: Filtering and Selecting Data
Slicing rows
Slicing columns
Dicing
Concatenating and Transforming
Adding new cases and variables
Removing data
Sorting and shuffling
Aggregating Data at Any Level
Chapter 8 Reshaping Data
Using the Bag of Words Model to Tokenize Data
Understanding the bag of words model
Sequencing text items with n-grams
Implementing TF-IDF transformations
Working with Graph Data
Understanding the adjacency matrix
Using NetworkX basics
Chapter 9 Putting What You Know into Action
Contextualizing Problems and Data
Evaluating a data science problem
Researching solutions
Formulating a hypothesis
Preparing your data
Considering the Art of Feature Creation
Defining feature creation
Combining variables
Understanding binning and discretization
Using indicator variables
Transforming distributions
Performing Operations on Arrays
Using vectorization
Performing simple arithmetic on vectors and matrices
Performing matrix vector multiplication
Performing matrix multiplication
Part 3 Visualizing Information
Chapter 10 Getting a Crash Course in Matplotlib
Starting with a Graph
Defining the plot
Drawing multiple lines and plots
Saving your work to disk
Setting the Axis, Ticks, and Grids
Getting the axes
Formatting the axes
Adding grids
Defining the Line Appearance
Working with line styles
Using colors
Adding markers
Using Labels, Annotations, and Legends
Adding labels
Annotating the chart
Creating a legend
Chapter 11 Visualizing the Data
Choosing the Right Graph
Creating comparisons with bar charts
Showing distributions using histograms
Depicting groups using boxplots
Seeing data patterns using scatterplots
Creating Advanced Scatterplots
Depicting groups
Showing correlations
Plotting Time Series
Representing time on axes
Plotting trends over time
Plotting Geographical Data
Using an environment in Notebook
Using Cartopy to plot geographic data
Avoiding outdated libraries: The Basemap Toolkit
Visualizing Graphs
Developing undirected graphs
Developing directed graphs
Part 4 Wrangling Data
Chapter 12 Stretching Python’s Capabilities
Playing with Scikit-learn
Understanding classes in Scikit-learn
Defining applications for data science
Using Transformative Functions
Chaining estimators
Transforming targets
Composing features
Handling heterogeneous data
Considering Timing and Performance
Benchmarking with timeit
Working with the memory profiler
Running in Parallel on Multiple Cores
Performing multicore parallelism
Demonstrating multiprocessing
Chapter 13 Exploring Data Analysis
The EDA Approach
Defining Descriptive Statistics for Numeric Data
Measuring central tendency
Measuring variance and range
Working with percentiles
Defining measures of normality
Counting for Categorical Data
Understanding frequencies
Creating contingency tables
Creating Applied Visualization for EDA
Inspecting boxplots
Performing t-tests after boxplots
Observing parallel coordinates
Graphing distributions
Plotting scatterplots
Understanding Correlation
Using covariance and correlation
Using nonparametric correlation
Considering chi-square for tables
Working with Cramér’s V
Modifying Data Distributions
Using different statistical distributions
Creating a Z-score standardization
Transforming other notable distributions
Chapter 14 Reducing Dimensionality
Understanding SVD
Looking for dimensionality reduction
Using SVD to measure the invisible
Performing Factor Analysis and PCA
Considering the psychometric model
Looking for hidden factors
Using components, not factors
Achieving dimensionality reduction
Squeezing information with t-SNE
Understanding Some Applications
Recognizing faces with PCA
Extracting topics with NMF
Recommending movies
Chapter 15 Clustering
Clustering with K-means
Understanding centroid-based algorithms
Creating an example with image data
Looking for optimal solutions
Clustering big data
Performing Hierarchical Clustering
Using a hierarchical cluster solution
Visualizing aggregative clustering solutions
Discovering New Groups with DBScan
Chapter 16 Detecting Outliers in Data
Considering Outlier Detection
Finding more things that can go wrong
Understanding anomalies and novel data
Examining a Simple Univariate Method
Leveraging on the Gaussian distribution
Remediating outliers
Developing a Multivariate Approach
Using principal component analysis
Using cluster analysis for spotting outliers
Automating detection with Isolation Forests
Part 5 Learning from Data
Chapter 17 Exploring Four Simple and Effective Algorithms
Guessing the Number: Linear Regression
Defining the family of linear models
Using more variables
Understanding limitations and problems
Moving to Logistic Regression
Applying logistic regression
Considering the case when there are more classes
Making Things as Simple as Naïve Bayes
Finding out that Naïve Bayes isn’t so naïve
Predicting text classifications
Learning Lazily with Nearest Neighbors
Predicting after observing neighbors
Choosing your k parameter wisely
Chapter 18 Performing Cross-Validation, Selection, and Optimization
Pondering the Problem of Fitting a Model
Understanding bias and variance
Defining a strategy for picking models
Dividing between training and test sets
Cross-Validating
Using cross-validation on k folds
Sampling stratifications for complex data
Selecting Variables Like a Pro
Selecting by univariate measures
Employing forward and backward selection
Pumping Up Your Hyperparameters
Implementing a grid search
Trying a randomized search
Chapter 19 Increasing Complexity with Linear and Nonlinear Tricks
Using Nonlinear Transformations
Doing variable transformations
Creating interactions between variables
Regularizing Linear Models
Relying on Ridge regression (L2)
Using the Lasso (L1)
Leveraging regularization
Combining L1 & L2: Elasticnet
Fighting with Big Data Chunk by Chunk
Determining when there is too much data
Implementing Stochastic Gradient Descent
Understanding Support Vector Machines
Relying on a computational method
Fixing many new parameters
Classifying with SVC
Going nonlinear is easy
Performing regression with SVR
Creating a stochastic solution with SVM
Playing with Neural Networks
Understanding neural networks
Classifying and regressing with neurons
Chapter 20 Understanding the Power of the Many
Starting with a Plain Decision Tree
Understanding a decision tree
Creating classification trees
Creating regression trees
Getting Lost in a Random Forest
Making machine learning accessible
Working with a Random Forest classifier
Working with a Random Forest regressor
Optimizing a Random Forest
Boosting Predictions
Knowing that many weak predictors win
Setting a gradient boosting classifier
Running a gradient boosting regressor
Using GBM hyperparameters
Using XGBoost
Part 6 The Part of Tens
Chapter 21 Ten Essential Data Resources
Discovering the News with Reddit
Getting a Good Start with KDnuggets
Locating Free Learning Resources with Quora
Gaining Insights with Oracle’s AI & Data Science Blog
Accessing the Huge List of Resources on Data Science Central
Discovering New Beginner Data Science Methodologies at Data Science 101
Obtaining the Most Authoritative Sources at Udacity
Receiving Help with Advanced Topics at Conductrics
Obtaining the Facts of Open Source Data Science from Springboard
Zeroing In on Developer Resources with Jonathan Bower
Chapter 22 Ten Data Challenges You Should Take
Removing Personally Identifiable Information
Creating a Secure Data Environment
Working with a Multiple-Data- Source Problem
Honing Your Overfit Strategies
Trudging Through the MovieLens Dataset
Locating the Correct Data Source
Working with Handwritten Information
Working with Pictures
Indentifying Data Lineage
Interacting with a Huge Graph
Index
EULA