Algorithms and Data Structures for Massive Datasets

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

The unprecedented growth of data in recent years is putting the spotlight on the data structures and algorithms that can efficiently handle large datasets. In this book, we present you with a basic suite of data structures and algorithms designed to index, query, and analyze massive data. What prompted us to write this book is that many of the novel data structures and algorithms that run underneath Google, Facebook, Dropbox and many others, are making their way into the mainstream algorithms curricula very slowly. Often the main resources on this subject are research papers filled with sophisticated and enlightening theory, but with little instruction on how to configure the data structures in a practical setting, or when to use them. Our goal was to present these exciting and cutting-edge topics in one place, in a practical and friendly tone. Mathematical intuition is important for understanding the subject, and we try to cultivate it without including a single proof. Plentiful illustrations are used to illuminate some of the more challenging material. Large datasets arise in a variety of disciplines, from bioinformatics and finance, to sensor data and social networks, and our use cases are designed to reflect that.

Author(s): Dzejla Medjedovic, Emin Tahirovic
Series: Manning Early Access Program
Publisher: Manning Publications
Year: 2021

Language: English

Algorithms and Data Structures for Massive Datasets MEAP V03
Copyright
Welcome
Brief contents
1: Introduction
1.1 An example
1.1.1 An example: how to solve it
1.1.2 An example: how to solve it, take two
1.2 The structure of this book
1.3 What makes this book different and whom it is for
1.4 Why is massive data so challenging for today’s systems?
1.4.1 The CPU-memory performance gap
1.4.2 Memory hierarchy
1.4.3 What about distributed systems?
1.5 Summary
2: Review of Hash Tables and Modern Hashing
2.1 Ubiquitous hashing
2.2 A crash course on data structures
2.3 Usage scenarios in modern systems
2.3.1 Deduplication in backup/storage solutions
2.3.2 Plagiarism detection with MOSS and Rabin-Karp fingerprinting
2.4 O(1) --- what’s the big deal?
2.5 Collision Resolution: theory vs. practice
2.6 Usage scenario: How Python’s dict does it
2.7 MurmurHash
2.8 Hash Tables for Distributed Systems: Consistent Hashing
2.8.1 A typical hashing problem?
2.8.2 Hashring
2.8.3 Lookup
2.8.4 Adding a new node/resource
2.8.5 Removing a node
2.8.6 Consistent hashing scenario: Chord
2.9 Summary
3: Approximate Membership and Bloom Filter
3.1 How It Works
3.1.1 Insert
3.1.2 Lookup
3.2 Use Cases
3.2.1 Bloom Filters in Networks: Squid
3.2.2 Bitcoin mobile app
3.3 Configuring a Bloom filter for your application
3.3.1 Examples
3.4 A bit of theory
3.4.1 Can we do better?
3.5 Further reading: Bloom filter adaptations and alternatives
3.6 Quotient filter
3.6.1 Quotienting
3.6.2 Resizing
3.7 Summary
4: Frequency Estimation and Count-Min Sketch
4.1 Streaming data
4.2 Count-min sketch: how it works
4.2.1 Update
4.2.2 Estimate
4.2.3 Space and error in count-min sketch
4.3 Use cases
4.3.1 Top-k restless sleepers
4.3.2 Scaling distributional similarity of words
4.4 Range queries with count-min sketch
4.5 Approximate heavy hitters
4.5.1 Majority element
4.5.2 General heavy hitters
4.6 Summary
5: Cardinality Estimation and HyperLogLog
5.1 Counting distinct items in databases
5.2 HyperLogLog incremental design
5.2.1 The first cut --- probabilistic counting
5.2.2 Stochastic averaging or, when life gives you lemons…
5.2.3 LogLog
5.2.4 HyperLogLog --- Stochastic averaging with harmonic mean
5.3 Use case: catching worms with HLL
5.4 But how does it actually work? A mini experiment
5.4.1 The effect of the number of buckets ()
5.5 Use case: Aggregation using HyperLogLog
5.6 Summary
6: Streaming Data: Bringing Everything Together
6.1 Streaming Data System – a meta-example
6.1.1 Bloom-join
6.1.2 De-duplication
6.1.3 Load balancing and tracking the network traffic
6.2 The Future is coming: in discrete batches or as a continuous stream ?
6.3 Practical constraints and concepts in data streams
6.3.1 Time
6.3.2 Small time and small space
6.3.3 Concept shifts and concept drifts
6.3.4 Sliding window model
6.4 Summary