For small businesses, analyzing the information contained in their data using open source technology could be game-changing. All you need is some basic programming and mathematical skills to do just that. Overview Explore how to analyze your data in various innovative ways and turn them into insight Learn to use the D3.js visualization tool for exploratory data analysis Understand how to work with graphs and social data analysis Discover how to perform advanced query techniques and run MapReduce on MongoDB In Detail Plenty of small businesses face big amounts of data but lack the internal skills to support quantitative analysis. Understanding how to harness the power of data analysis using the latest open source technology can lead them to providing better customer service, the visualization of customer needs, or even the ability to obtain fresh insights about the performance of previous products. Practical Data Analysis is a book ideal for home and small business users who want to slice and dice the data they have on hand with minimum hassle. Practical Data Analysis is a hands-on guide to understanding the nature of your data and turn it into insight. It will introduce you to the use of machine learning techniques, social networks analytics, and econometrics to help your clients get insights about the pool of data they have at hand. Performing data preparation and processing over several kinds of data such as text, images, graphs, documents, and time series will also be covered. Practical Data Analysis presents a detailed exploration of the current work in data analysis through self-contained projects. First you will explore the basics of data preparation and transformation through OpenRefine. Then you will get started with exploratory data analysis using the D3js visualization framework. You will also be introduced to some of the machine learning techniques such as, classification, regression, and clusterization through practical projects such as spam classification, predicting gold prices, and finding clusters in your Facebook friends' network. You will learn how to solve problems in text classification, simulation, time series forecast, social media, and MapReduce through detailed projects. Finally you will work with large amounts of Twitter data using MapReduce to perform a sentiment analysis implemented in Python and MongoDB. Practical Data Analysis contains a combination of carefully selected algorithms and data scrubbing that enables you to turn your data into insight. What you will learn from this book Work with data to get meaningful results from your data analysis projects Visualize your data to find trends and correlations Build your own image similarity search engine Learn how to forecast numerical values from time series data Create an interactive visualization for your social media graph Explore the MapReduce framework in MongoDB Create interactive simulations with D3js Approach Practical Data Analysis is a practical, step-by-step guide to empower small businesses to manage and analyze your data and extract valuable information from the data Who this book is written for This book is for developers, small business users, and analysts who want to implement data analysis and visualization for their company in a practical way. You need no prior experience with data analysis or data processing; however, basic knowledge of programming, statistics, and linear algebra is assumed.
Author(s): Hector Cuesta; Sampath Kumar
Edition: 2
Publisher: Packt Publishing
Year: 2016
Cover
Copyright
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Table of Contents
Preface
Chapter 1: Getting Started
[Computer science]
Computer science
Artificial intelligence
Machine learning
Statistics
Mathematics
Knowledge domain
Data, information, and knowledge
Inter-relationship between data, information, and knowledge
The nature of data
The data analysis process
The problem
Data preparation
Data exploration
Predictive modeling
Visualization of results
Quantitative versus qualitative data analysis
Importance of data visualization
What about big data?
Quantified self
Sensors and cameras
Social network analysis
Tools and toys for this book
Why Python?
Why mlpy?
Why D3.js?
Why MongoDB?
Summary
Chapter 2: Preprocessing Data
Data sources
Open data
Text files
Excel files
SQL databases
NoSQL databases
Multimedia
Web scraping
Data scrubbing
Statistical methods
Text parsing
Data transformation
Data formats
Parsing a CSV file with the CSV module
Parsing CSV file using NumPy
JSON
Parsing JSON file using the JSON module
XML
Parsing XML in Python using the XML module
YAML
Data reduction methods
Filtering and sampling
Binned algorithm
Dimensionality reduction
Getting started with OpenRefine
Text facet
Clustering
Text filters
Numeric facets
Transforming data
Exporting data
Operation history
Summary
Chapter 3: Getting to Grips with Visualization
What is visualization?
Working with web-based visualization
Exploring scientific visualization
Visualization in art
The visualization life cycle
Visualizing different types of data
HTML
DOM
CSS
JavaScript
SVG
Getting started with D3.js
Bar chart
Pie chart
Scatter plots
Single line chart
Multiple line chart
Interaction and animation
Data from social networks
An overview of visual analytics
Summary
Chapter 4: Text Classification
Learning and classification
Bayesian classification
Naïve Bayes
E-mail subject line tester
The data
The algorithm
Classifier accuracy
Summary
Chapter 5: Similarity-Based Image Retrieval
Image similarity search
Dynamic time warping
Processing the image dataset
Implementing DTW
Analyzing the results
Summary
Chapter 6: Simulation of Stock Prices
Financial time series
Random Walk simulation
Monte Carlo methods
Generating random numbers
Implementation in D3js
Quantitative analyst
Summary
Chapter 7: Predicting Gold Prices
Working with time series data
Components of a time series
Smoothing time series
Lineal regression
The data – historical gold prices
Nonlinear regressions
Kernel Ridge Regressions
Smoothing the gold prices time series
Predicting in the smoothed time series
Contrasting the predicted value
Summary
Chapter 8: Working with Support Vector Machines
Understanding the multivariate dataset
Dimensionality reduction
Linear Discriminant Analysis (LDA)
Principal Component Analysis (PCA)
Getting started with SVM
Kernel functions
The double spiral problem
SVM implemented on mlpy
Summary
Chapter 9: Modeling Infectious Diseases with Cellular Automata
Introduction to epidemiology
The epidemiology triangle
The epidemic models
The SIR model
Solving the ordinary differential equation for the SIR model with SciPy
The SIRS model
Modeling with Cellular Automaton
Cell, state, grid, neighborhood
Global stochastic contact model
Simulation of the SIRS model in CA with D3.js
Summary
Chapter 10: Working with Social Graphs
Structure of a graph
Undirected graph
Directed graph
Social networks analysis
Acquiring the Facebook graph
Working with graphs using Gephi
Statistical analysis
Male to female ratio
Degree distribution
Histogram of a graph
Centrality
Transforming GDF to JSON
Graph visualization with D3.js
Summary
Chapter 11: Working with Twitter Data
The anatomy of Twitter data
Tweet
Followers
Trending topics
Using OAuth to access Twitter API
Getting started with Twython
Simple search using Twython
Working with timelines
Working with followers
Working with places and trends
Working with user data
Streaming API
Summary
Chapter 12: Data Processing and Aggregation with MongoDB
Getting started with MongoDB
Database
Collection
Document
Mongo shell
Insert/Update/Delete
Queries
Data preparation
Data transformation with OpenRefine
Inserting documents with PyMongo
Group
Aggregation framework
Pipelines
Expressions
Summary
Chapter 13: Working with MapReduce
An overview of MapReduce
Programming model
Using MapReduce with MongoDB
Map function
Reduce function
Using mongo shell
Using Jupyter
Using PyMongo
Filtering the input collection
Grouping and aggregation
Counting the most common words in tweets
Summary
Chapter 14: Online Data Analysis with Jupyter and Wakari
Getting started with Wakari
Creating an account in Wakari
Getting started with IPython notebook
Data visualization
Introduction to image processing with PIL
Opening an image
Working with an image histogram
Filtering
Operations
Transformations
Getting started with pandas
Working with Time Series
Working with multivariate datasets with DataFrame
Grouping, Aggregation, and Correlation
Sharing your Notebook
The data
Summary
Chapter 15: Understanding Data Processing using Apache Spark
Platform for data processing
The Cloudera platform
Installing Cloudera VM
An introduction to the distributed file system
First steps with Hadoop Distributed File System – HDFS
File management with HUE – web interface
An introduction to Apache Spark
The Spark ecosystem
The Spark programming model
An introductory working example of Apache Startup
Summary
Index