Introducing Data Science: Big Data, Machine Learning, and More, using Python Tools

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Author(s): Davy Cielen; Arno D.B. Meysman; Mohamed Ali
Publisher: Manning Publications
Year: 2016

Language: English
Pages: 320

Front cover
brief contents
contents
preface
acknowledgments
about this book
Roadmap
Whom this book is for
Code conventions and downloads
about the authors
Author Online
about the cover illustration
1 Data science in a big data world
1.1 Benefits and uses of data science and big data
1.2 Facets of data
1.2.1 Structured data
1.2.2 Unstructured data
1.2.3 Natural language
1.2.4 Machine-generated data
1.2.5 Graph-based or network data
1.2.6 Audio, image, and video
1.2.7 Streaming data
1.3 The data science process
1.3.1 Setting the research goal
1.3.2 Retrieving data
1.3.3 Data preparation
1.3.4 Data exploration
1.3.5 Data modeling or model building
1.3.6 Presentation and automation
1.4 The big data ecosystem and data science
1.4.1 Distributed file systems
1.4.2 Distributed programming framework
1.4.3 Data integration framework
1.4.4 Machine learning frameworks
1.4.5 NoSQL databases
1.4.6 Scheduling tools
1.4.7 Benchmarking tools
1.4.8 System deployment
1.4.9 Service programming
1.4.10 Security
1.5 An introductory working example of Hadoop
1.6 Summary
2 The data science process
2.1 Overview of the data science process
2.1.1 Don’t be a slave to the process
2.2 Step 1: Defining research goals and creating a project charter
2.2.1 Spend time understanding the goals and context of your research
2.2.2 Create a project charter
2.3 Step 2: Retrieving data
2.3.1 Start with data stored within the company
2.3.2 Don’t be afraid to shop around
2.3.3 Do data quality checks now to prevent problems later
2.4 Step 3: Cleansing, integrating, and transforming data
2.4.1 Cleansing data
2.4.2 Correct errors as early as possible
2.4.3 Combining data from different data sources
2.4.4 Transforming data
2.5 Step 4: Exploratory data analysis
2.6 Step 5: Build the models
2.6.1 Model and variable selection
2.6.2 Model execution
2.6.3 Model diagnostics and model comparison
2.7 Step 6: Presenting findings and building applications on top of them
2.8 Summary
3 Machine learning
3.1 What is machine learning and why should you care about it?
3.1.1 Applications for machine learning in data science
3.1.2 Where machine learning is used in the data science process
3.1.3 Python tools used in machine learning
3.2 The modeling process
3.2.1 Engineering features and selecting a model
3.2.2 Training your model
3.2.3 Validating a model
3.2.4 Predicting new observations
3.3 Types of machine learning
3.3.1 Supervised learning
3.3.2 Unsupervised learning
3.4 Semi-supervised learning
3.5 Summary
4 Handling large data on a single computer
4.1 The problems you face when handling large data
4.2 General techniques for handling large volumes of data
4.2.1 Choosing the right algorithm
4.2.2 Choosing the right data structure
4.2.3 Selecting the right tools
4.3 General programming tips for dealing with large data sets
4.3.1 Don’t reinvent the wheel
4.3.2 Get the most out of your hardware
4.3.3 Reduce your computing needs
4.4 Case study 1: Predicting malicious URLs
4.4.1 Step 1: Defining the research goal
4.4.2 Step 2: Acquiring the URL data
4.4.3 Step 4: Data exploration
4.4.4 Step 5: Model building
4.5 Case study 2: Building a recommender system inside a database
4.5.1 Tools and techniques needed
4.5.2 Step 1: Research question
4.5.3 Step 3: Data preparation
4.5.4 Step 5: Model building
4.5.5 Step 6: Presentation and automation
4.6 Summary
5 First steps in big data
5.1 Distributing data storage and processing with frameworks
5.1.1 Hadoop: a framework for storing and processing large data sets
5.1.2 Spark: replacing MapReduce for better performance
5.2 Case study: Assessing risk when loaning money
5.2.1 Step 1: The research goal
5.2.2 Step 2: Data retrieval
5.2.3 Step 3: Data preparation
5.2.4 Step 4: Data exploration & Step 6: Report building
5.3 Summary
6 Join the NoSQL movement
6.1 Introduction to NoSQL
6.1.1 ACID: the core principle of relational databases
6.1.2 CAP Theorem: the problem with DBs on many nodes
6.1.3 The BASE principles of NoSQL databases
6.1.4 NoSQL database types
6.2 Case study: What disease is that?
6.2.1 Step 1: Setting the research goal
6.2.2 Steps 2 and 3: Data retrieval and preparation
6.2.3 Step 4: Data exploration
6.2.4 Step 3 revisited: Data preparation for disease profiling
6.2.5 Step 4 revisited: Data exploration for disease profiling
6.2.6 Step 6: Presentation and automation
6.3 Summary
7 The rise of graph databases
7.1 Introducing connected data and graph databases
7.1.1 Why and when should I use a graph database?
7.2 Introducing Neo4j: a graph database
7.2.1 Cypher: a graph query language
7.3 Connected data example: a recipe recommendation engine
7.3.1 Step 1: Setting the research goal
7.3.2 Step 2: Data retrieval
7.3.3 Step 3: Data preparation
7.3.4 Step 4: Data exploration
7.3.5 Step 5: Data modeling
7.3.6 Step 6: Presentation
7.4 Summary
8 Text mining and text analytics
8.1 Text mining in the real world
8.2 Text mining techniques
8.2.1 Bag of words
8.2.2 Stemming and lemmatization
8.2.3 Decision tree classifier
8.3 Case study: Classifying Reddit posts
8.3.1 Meet the Natural Language Toolkit
8.3.2 Data science process overview and step 1: The research goal
8.3.3 Step 2: Data retrieval
8.3.4 Step 3: Data preparation
8.3.5 Step 4: Data exploration
8.3.6 Step 3 revisited: Data preparation adapted
8.3.7 Step 5: Data analysis
8.3.8 Step 6: Presentation and automation
8.4 Summary
9 Data visualization to the end user
9.1 Data visualization options
9.2 Crossfilter, the JavaScript MapReduce library
9.2.1 Setting up everything
9.2.2 Unleashing Crossfilter to filter the medicine data set
9.3 Creating an interactive dashboard with dc.js
9.4 Dashboard development tools
9.5 Summary
Appendix A—Setting up Elasticsearch
A.1 Linux installation
A.2 Windows installation
Appendix B—Setting up Neo4j
B.1 Linux installation
B.2 Windows installation
Appendix C—Installing MySQL server
C.1 Windows installation
C.2 Linux installation
Appendix D—Setting up Anaconda with a virtual environment
D.1 Linux installation
D.2 Windows installation
D.3 Setting up the environment
index
Symbols
Numerics
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
Back cover