Thoughtful Data Science

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Thoughtful Data Science brings new strategies and a carefully crafted programmer's toolset to work with modern, cutting-edge data analysis. This new approach is designed specifically to give developers more efficiency and power to create cutting-edge data analysis and artificial intelligence insights. Industry expert David Taieb bridges the gap between developers and data scientists by creating a modern open-source, Python-based toolset that works with Jupyter Notebook, and PixieDust. You'll find the right balance of strategic thinking and practical projects throughout this book, with extensive code files and Jupyter projects that you can integrate with your own data analysis. David Taieb introduces four projects designed to connect developers to important industry use cases in data science. The first is an image recognition application with TensorFlow, to meet the growing importance of AI in data analysis. The second analyses social media trends to explore big data issues and natural language processing. The third is a financial portfolio analysis application using time series analysis, pivotal in many data science applications today. The fourth involves applying graph algorithms to solve data problems. Taieb wraps up with a deep look into the future of data science for developers and his views on AI for data science.

Author(s): David Taieb
Publisher: Packt Publishing
Year: 2018

Language: English
Pages: 491
Tags: Data Science

Cover......Page 1
Copyright......Page 3
Packt upsell......Page 5
Contributors......Page 6
Table of Contents......Page 8
Preface......Page 12
What is data science......Page 24
Is data science here to stay?......Page 25
Why is data science on the rise?......Page 26
What does that have to do with developers?......Page 27
Putting these concepts into practice......Page 29
Deep diving into a concrete example......Page 30
Data pipeline blueprint......Page 31
What kind of skills are required to become a data scientist?......Page 33
IBM Watson DeepQA......Page 35
Back to our sentiment analysis of Twitter hashtags project......Page 38
Lessons learned from building our first enterprise-ready data pipeline......Page 42
Data science strategy......Page 43
Jupyter Notebooks at the center of our strategy......Page 45
Why are Notebooks so popular?......Page 46
Summary......Page 48
Chapter 2 - Data Science at Scale with Jupyter Notebooks and PixieDust......Page 50
Why choose Python?......Page 51
Introducing PixieDust......Page 55
SampleData – a simple API for loading data......Page 59
Wrangling data with pixiedust_rosie......Page 65
Display – a simple interactive API for data visualization......Page 72
Filtering......Page 83
Bridging the gap between developers and data scientists with PixieApps......Page 86
Architecture for operationalizing data science analytics......Page 90
Summary......Page 95
Chapter 3 - PixieApp under the Hood......Page 96
Anatomy of a PixieApp......Page 97
Routes......Page 99
Generating requests to routes......Page 102
A GitHub project tracking sample application......Page 103
Displaying the search results in a table......Page 107
Invoking the PixieDust display() API using pd_entity attribute......Page 115
Invoking arbitrary Python code with pd_script......Page 123
Making the application more responsive with pd_refresh......Page 128
Creating reusable widgets......Page 130
Summary......Page 131
Chapter 4 - Deploying PixieApps
to the Web with the PixieGateway Server......Page 132
Overview of Kubernetes......Page 133
Installing and configuring the PixieGateway server......Page 135
PixieGateway server configuration......Page 139
PixieGateway architecture......Page 143
Publishing an application......Page 147
Encoding state in the PixieApp URL......Page 151
Sharing charts by publishing them as web pages......Page 152
PixieGateway admin console......Page 157
Python Console......Page 160
Displaying warmup and run code for a PixieApp......Page 161
Summary......Page 162
Chapter 5 - Best Practices and Advanced PixieDust Concepts......Page 164
Create a word cloud image with
@captureOutput......Page 165
Increase modularity and code reuse......Page 168
Creating a widget with pd_widget......Page 171
PixieDust support of streaming data......Page 173
Adding streaming capabilities to your PixieApp......Page 176
Adding dashboard drill-downs with PixieApp events......Page 179
Extending PixieDust visualizations......Page 184
Debugging on the Jupyter Notebook using pdb......Page 192
Visual debugging with PixieDebugger......Page 196
Debugging PixieApp routes with PixieDebugger......Page 199
Troubleshooting issues using PixieDust logging......Page 201
Client-side debugging......Page 204
Run Node.js inside a Python Notebook......Page 206
Summary......Page 211
Chapter 6 - Image Recognition
with TensorFlow......Page 212
What is machine learning?......Page 213
What is deep learning?......Page 215
Getting started with TensorFlow......Page 218
Simple classification with DNNClassifier......Page 222
Image recognition sample application......Page 234
Part 1 – Load the pretrained MobileNet model......Page 235
Part 2 – Create a PixieApp for our image recognition sample application......Page 243
Part 3 – Integrate the TensorBoard graph visualization......Page 247
Part 4 – Retrain the model with custom training data......Page 253
Summary......Page 265
Chapter 7 - Big Data Twitter
Sentiment Analysis......Page 266
Apache Spark architecture......Page 267
Configuring Notebooks to work with Spark......Page 269
Twitter sentiment analysis application......Page 271
Architecture diagram for the data pipeline......Page 272
Authentication with Twitter......Page 273
Creating the Twitter stream......Page 274
Creating a Spark Streaming DataFrame......Page 278
Creating and running a structured query......Page 281
Monitoring active streaming queries......Page 283
Creating a batch DataFrame from the Parquet files......Page 285
Getting started with the IBM Watson Natural Language Understanding service......Page 288
Part 3 – Creating a real-time dashboard PixieApp......Page 296
Refactoring the analytics into their own methods......Page 297
Creating the PixieApp......Page 299
Part 4 – Adding scalability with Apache Kafka and IBM Streams Designer......Page 309
Streaming the raw tweets to Kafka......Page 311
Enriching the tweets data with the Streaming Analytics service......Page 314
Creating a Spark Streaming DataFrame
with a Kafka input source......Page 321
Summary......Page 325
Chapter 8 - Financial Time Series Analysis and Forecasting......Page 326
Getting started with NumPy......Page 327
Creating a NumPy array......Page 330
Operations on ndarray......Page 333
Selections on NumPy arrays......Page 335
Broadcasting......Page 336
Statistical exploration of time series......Page 338
Hypothetical investment......Page 346
Autocorrelation function (ACF) and partial autocorrelation function (PACF)......Page 347
Putting it all together with the StockExplorer PixieApp......Page 351
BaseSubApp – base class for all the child PixieApps......Page 356
StockExploreSubApp – first child PixieApp......Page 358
MovingAverageSubApp – second child PixieApp......Page 360
AutoCorrelationSubApp – third child PixieApp......Page 364
Time series forecasting using the ARIMA model......Page 366
Build an ARIMA model for the MSFT stock time series......Page 369
StockExplorer PixieApp Part 2 – add time series forecasting using the ARIMA model......Page 378
Summary......Page 394
Chapter 9 - US Domestic Flight Data Analysis Using Graphs......Page 396
Introduction to graphs......Page 397
Graph representations......Page 398
Graph algorithms......Page 400
Graph and big data......Page 403
Getting started with the networkx graph library......Page 404
Creating a graph......Page 405
Visualizing a graph......Page 407
Part 1 – Loading the US domestic flight data into a graph......Page 408
Graph centrality......Page 417
Part 2 – Creating the USFlightsAnalysis PixieApp......Page 427
Part 3 – Adding data exploration to the USFlightsAnalysis PixieApp......Page 438
Part 4 – Creating an ARIMA model for predicting flight delays......Page 448
Summary......Page 463
Chapter 10 - Final Thoughts......Page 464
Forward thinking – what to expect for AI and data science......Page 465
References......Page 468
Annotations......Page 470
Custom HTML attributes......Page 473
Methods......Page 478
Other Books
You May Enjoy......Page 480
Leave a review – let other readers know what you think......Page 482
Index......Page 484