The array of tools for collecting, storing, and gaining insight from data is huge and
getting bigger every day. For people entering the field, that means digging through
hundreds of Web sites and dozens of books to get the basics of working with data at
scale. That’s why this book is a great addition to the Addison-Wesley Data & Analytics
series; it provides a broad overview of tools, techniques, and helpful tips for building
large data analysis systems.
Michael is the perfect author to provide this introduction to Big Data analytics. He
worked on the Cloud Platform Developer Relations team at Google, helping develop-ers with BigQuery, Google’s hosted platform for analyzing terabytes of data quickly.
He brings his breadth of experience to this book, providing practical guidance for
anyone looking to start working with Big Data or anyone looking for additional tips,
tricks, and tools.
The introductory chapters start with guidelines for success with Big Data systems
and introductions to NoSQL, distributed computing, and the CAP theorem. An intro-duction to analytics at scale using Hadoop and Hive is followed by coverage of real-time analytics with BigQuery. More advanced topics include MapReduce pipelines,
Pig and Cascading, and machine learning with Mahout. Finally, you’ll see examples
of how to blend Python and R into a working Big Data tool chain. Throughout all
of this material are examples that help you work with and learn the tools. All of this
combines to create a perfect book to read for picking up a broad understanding of Big
Data analytics.
—Paul Dix, Series Editor
Author(s): Michael Manoochehri
Series: Data & Analytics Series
Edition: 1st
Publisher: Addison-Wesley
Year: 0
Language: English
Pages: 256
Tags: Информатика и вычислительная техника;Искусственный интеллект;Интеллектуальный анализ данных;
Contents......Page 8
Foreword......Page 16
Preface......Page 18
Acknowledgments......Page 26
About the Author......Page 28
I: Directives in the Big Data Era......Page 30
When Data Became a BIG Deal......Page 32
Data and the Single Server......Page 33
The Big Data Trade-Off......Page 34
Anatomy of a Big Data Pipeline......Page 38
Summary......Page 39
II: Collecting and Sharing a Lot of Data......Page 40
2 Hosting and Sharing Terabytes of Raw Data......Page 42
Suffering from Files......Page 43
Storage: Infrastructure as a Service......Page 44
Choosing the Right Data Format......Page 45
Character Encoding......Page 48
Data in Motion: Data Serialization Formats......Page 50
Summary......Page 52
Relational Databases: Command and Control......Page 54
Relational Databases versus the Internet......Page 57
Nonrelational Database Models......Page 60
Leaning toward Write Performance: Redis......Page 64
Sharding across Many Redis Instances......Page 67
NewSQL: The Return of Codd......Page 70
Summary......Page 71
A Warehouse Full of Jargon......Page 72
Hadoop: The Elephant in the Warehouse......Page 77
Data Silos Can Be Good......Page 78
Convergence: The End of the Data Silo......Page 80
Summary......Page 82
III: Asking Questions about Your Data......Page 84
What Is a Data Warehouse?......Page 86
Apache Hive: Interactive Querying for Hadoop......Page 89
Shark: Queries at the Speed of RAM......Page 94
Data Warehousing in the Cloud......Page 95
Summary......Page 96
Analytical Databases......Page 98
Dremel: Spreading the Wealth......Page 100
BigQuery: Data Analytics as a Service......Page 102
Building a Custom Big Data Dashboard......Page 104
The Future of Analytical Query Engines......Page 111
Summary......Page 112
7 Visualization Strategies for Exploring Large Datasets......Page 114
Cautionary Tales: Translating Data into Narrative......Page 115
Human Scale versus Machine Scale......Page 118
Building Applications for Data Interactivity......Page 119
Summary......Page 125
IV: Building Data Pipelines......Page 126
What Is a Data Pipeline?......Page 128
Data Pipelines with Hadoop Streaming......Page 130
A One-Step MapReduce Transformation......Page 134
Managing Complexity: Python MapReduce Frameworks for Hadoop 110......Page 139
Summary......Page 143
9 Building Data Transformation Workflows with Pig and Cascading......Page 146
It’s Complicated: Multistep MapReduce Transformations......Page 147
Cascading: Building Robust Data-Workflow Applications......Page 151
Summary......Page 157
V: Machine Learning for Large Datasets......Page 158
10 Building a Data Classification System with Mahout......Page 160
Challenges of Machine Learning......Page 161
Apache Mahout: Scalable Machine Learning......Page 165
MLBase: Distributed Machine Learning Framework......Page 168
Summary......Page 169
VI: Statistical Analysis for Massive Datasets......Page 172
11 Using R with Large Datasets......Page 174
Why Statistics Are Sexy......Page 175
Strategies for Dealing with Large Datasets......Page 178
Summary......Page 184
The Snakes Are Loose in the Data Zoo......Page 186
Python Libraries for Data Processing......Page 189
Building More Complex Workflows......Page 196
iPython: Completing the Scientific Computing Tool Chain......Page 199
Summary......Page 203
VII: Looking Ahead......Page 206
Overlapping Solutions......Page 208
Understanding Your Data Problem......Page 210
A Playbook for the Build versus Buy Problem......Page 211
My Own Private Data Center......Page 213
Understand the Costs of Open-Source......Page 215
Summary......Page 216
14 The Future: Trends in Data Technology......Page 218
Hadoop: The Disruptor and the Disrupted......Page 219
Everything in the Cloud......Page 220
The Rise and Fall of the Data Scientist......Page 222
Convergence: The Ultimate Database......Page 224
Convergence of Cultures......Page 225
Summary......Page 226
A......Page 228
B......Page 229
C......Page 230
D......Page 231
G......Page 233
I......Page 234
K......Page 235
M......Page 236
N......Page 237
P......Page 238
R......Page 239
S......Page 240
T......Page 242
W......Page 243
Z......Page 244