Whether you’re a mathematician, seasoned data scientist, or marketing professional, you’ll find The Shape of Data to be the perfect introduction to the critical interplay between the geometry of data structures and Machine Learning.
This book’s extensive collection of case studies (drawn from medicine, education, sociology, linguistics, and more) and gentle explanations of the math behind dozens of algorithms provide a comprehensive yet accessible look at how geometry shapes the algorithms that drive data analysis.
Throughout the book’s mathematical data analytics tour, we encounter the origin of data analysis on structured data and the many seemingly unstructured data scenarios that can be turned into structured data, which enables standard machine learning algorithms to perform predictive and prescriptive analytical insights. As we ride through the valleys and peaks of our data, we learn to collect features along the way that become key inputs into other data layers, forming geometrical interpretations of varying unstructured data sources including network data, images, and text-based data. In addition, Farrelly and Gaba are masterful in detailing the foundational and advanced concepts supported by the well-defined examples in both R and Python, available for download from their book’s web page.
This book will be relevant and captivating to beginners and devoted experts alike. First-time travelers will find it easy to dive into algorithm examples designed for analyzing network data, including social and geographic networks, as well as local and global metrics, to understand network structure and the role of individuals in the network. The discussion covers clustering methods developed for use on network data, link prediction algorithms to suggest new edges in a network, and tools for understanding how, for example, processes or epidemics spread through networks.
Advanced readers will find it intriguing to dive into recently developing topics such as replacing linear algebra with nonlinear algebra in Machine Learning algorithms and exterior calculus to quantity needs in disaster planning. The Shape of Data has made me want to roll up my sleeves and dive into many new challenges, because I feel as well equipped as Lara Croft in Tomb Raider thanks to Farrelly’s tremendous treasure map and deeply insightful exploration work. Could there be a hidden bond or “hidden layer” between them?
In addition to gaining a deeper understanding of how to implement geometry-based algorithms with code, you’ll explore:
• Supervised and unsupervised learning algorithms and their application to network data analysis
• The way distance metrics and dimensionality reduction impact Machine Learning
• How to visualize, embed, and analyze survey and text data with topology-based algorithms
• New approaches to computational solutions, including distributed computing and quantum algorithms
Author(s): Colleen M. Farrelly; Yaé Ulrich Gaba
Publisher: No Starch Press
Year: 2023
Language: English
Pages: 264
Cover
Praise for The Shape of Data
Title Page
Copyright
Dedication
About the Authors
Foreword
Acknowledgments
Introduction
Who Is This Book For?
About This Book
Downloading and Installing R
Installing R Packages
Getting Help with R
Support for Python Users
Summary
Chapter 1: The Geometric Structure of Data
Machine Learning Categories
Supervised Learning
Unsupervised Learning
Matching Algorithms and Other Machine Learning
Structured Data
The Geometry of Dummy Variables
The Geometry of Numerical Spreadsheets
The Geometry of Supervised Learning
Unstructured Data
Network Data
Image Data
Text Data
Summary
Chapter 2: The Geometric Structure of Networks
The Basics of Network Theory
Directed Networks
Networks in R
Paths and Distance in a Network
Network Centrality Metrics
The Degree of a Vertex
The Closeness of a Vertex
The Betweenness of a Vertex
Eigenvector Centrality
PageRank Centrality
Katz Centrality
Hub and Authority
Measuring Centrality in an Example Social Network
Additional Quantities of a Network
The Diversity of a Vertex
Triadic Closure
The Efficiency and Eccentricity of a Vertex
Forman–Ricci Curvature
Global Network Metrics
The Interconnectivity of a Network
Spreading Processes on a Network
Spectral Measures of a Network
Network Models for Real-World Behavior
Erdös–Renyi Graphs
Scale-Free Graphs
Watts–Strogatz Graphs
Summary
Chapter 3: Network Analysis
Using Network Data for Supervised Learning
Making Predictions with Social Media Network Metrics
Predicting Network Links in Social Media
Using Network Data for Unsupervised Learning
Applying Clustering to the Social Media Dataset
Community Mining in a Network
Comparing Networks
Analyzing Spread Through Networks
Tracking Disease Spread Between Towns
Tracking Disease Spread Between Windsurfers
Disrupting Communication and Disease Spread
Summary
Chapter 4: Network Filtration
Graph Filtration
From Graphs to Simplicial Complexes
Examples of Betti Numbers
The Euler Characteristic
Persistent Homology
Comparison of Networks with Persistent Homology
Summary
Chapter 5: Geometry in Data Science
Common Distance Metrics
Simulating a Small Dataset
Using Norm-Based Distance Metrics
Comparing Diagrams, Shapes, and Probability Distributions
K-Nearest Neighbors with Metric Geometry
Manifold Learning
Using Multidimensional Scaling
Extending Multidimensional Scaling with Isomap
Capturing Local Properties with Locally Linear Embedding
Visualizing with t-Distributed Stochastic Neighbor Embedding
Fractals
Summary
Chapter 6: Newer Applications of Geometry in Machine Learning
Working with Nonlinear Spaces
Introducing dgLARS
Predicting Depression with dgLARS
Predicting Credit Default with dgLARS
Applying Discrete Exterior Derivatives
Nonlinear Algebra in Machine Learning Algorithms
Comparing Choice Rankings with HodgeRank
Summary
Chapter 7: Tools for Topological Data Analysis
Finding Distinctive Groups with Unique Behavior
Validating Measurement Tools
Using the Mapper Algorithm for Subgroup Mining
Stepping Through the Mapper Algorithm
Using TDAmapper to Find Cluster Structures in Data
Summary
Chapter 8: Homotopy Algorithms
Introducing Homotopy
Introducing Homotopy-Based Regression
Comparing Results on a Sample Dataset
Summary
Chapter 9: Final Project: Analyzing Text Data
Building a Natural Language Processing Pipeline
The Project: Analyzing Language in Poetry
Tokenizing Text Data
Tagging Parts of Speech
Normalizing Vectors
Analyzing the Poem Dataset in R
Using Topology-Based NLP Tools
Summary
Chapter 10: Multicore and Quantum Computing
Multicore Approaches to Topological Data Analysis
Quantum Computing Approaches
Using the Qubit-Based Model
Using the Qumodes-Based Model
Using Quantum Network Algorithms
Speeding Up Algorithms with Quantum Computing
Using Image Classifiers on Quantum Computers
Summary
References
Chapter 1
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Chapter 7
Chapter 8
Chapter 9
Chapter 10
Chapter 6 Datasets
Chapter 9 Dataset Poems
Index