Massive Graph Analytics

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

"Graphs. Such a simple idea. Map a problem onto a graph then solve it by searching over the graph or by exploring the structure of the graph. What could be easier? Turns out, however, that working with graphs is a vast and complex field. Keeping up is challenging. To help keep up, you just need an editor who knows most people working with graphs, and have that editor gather nearly 70 researchers to summarize their work with graphs. The result is the book Massive Graph Analytics." — Timothy G. Mattson, Senior Principal Engineer, Intel Corp Expertise in massive-scale graph analytics is key for solving real-world grand challenges from healthcare to sustainability to detecting insider threats, cyber defense, and more. This book provides a comprehensive introduction to massive graph analytics, featuring contributions from thought leaders across academia, industry, and government. Massive Graph Analytics will be beneficial to students, researchers, and practitioners in academia, national laboratories, and industry who wish to learn about the state-of-the-art algorithms, models, frameworks, and software in massive-scale graph analytics.

Author(s): BARDERS,DAVID. A.
Publisher: CRC PRESS
Year: 2022

Language: English
Commentary: Massive Graph Analytics, algorithms, models, frameworks, and software in massive-scale graph analytics.
Pages: 632
Tags: Massive Graph Analytics, algorithms, models, frameworks, and software in massive-scale graph analytics.

Cover
Half Title
Series Page
Title Page
Copyright Page
Table of Contents
Editor
Contributors
Introduction
SECTION I: Algorithms: Search and Paths
Chapter 1 A Work-Efficient Parallel Breadth-First Search Algorithm (or How To Cope With the Nondeterminism of Reducers)
1.1 Introduction
1.1.1 Outline
1.2 Background on Dynamic Multithreading
1.2.1 Multithreaded Pseudocode
1.2.2 Reducers
1.3 The PBFS Algorithm
1.4 The Bag Data Structure
1.4.1 Pennants
1.4.2 Bags
1.4.3 Optimization
1.5 Experimental Results
1.5.1 Implementation and Testing
1.5.2 Results
1.6 Background on the Dag Model
1.6.1 The Dag Model
1.6.2 Determinacy
1.6.3 Work and Span
1.6.4 Scheduling
1.7 Modeling Reducers
1.8 Analysis of Programs with Nonconstant-Time Reducers
1.8.1 Scheduling with Reducers
1.8.2 Analyzing a Performance Dag
1.8.3 Relating the Performance and User Dags
1.9 Analyzing PBFS
1.10 Conclusion
Acknowledgments
References
Chapter 2 Multi-Objective Shortest Paths
2.1 Introduction
2.2 Preliminaries
2.3 More Related Work
2.4 The Basic Algorithm
2.4.1 Example
2.4.2 Algorithm Details
2.5 A Bicriteria Instantiation – Theory
2.5.1 Node Local Operations
2.5.2 The Pareto Queue Q
2.5.3 Additional Improvements
2.6 A Practical Bicriteria Implementation
2.6.1 A Pareto Queue Based on Weight-Balanced B-trees
2.6.2 Further Practical Improvements
2.7 Experimental Evaluation
2.7.1 Pareto Queue Evaluation
2.7.2 Parallel Bi-Objective Search
2.7.3 Evaluation of Parallel Bi-Objective Search
2.8 Single Target Search
2.8.1 Number of Steps of Goal Directed Search
2.9 More Objectives
2.10 Conclusion
Acknowledgments
References
SECTION II: Algorithms: Structure
Chapter 3 Multicore Algorithms for Graph Connectivity Problems
3.1 Introduction
3.1.1 Contributions
3.2 Background
3.2.1 Strongly Connected Components
3.2.1.1 Serial Algorithms
3.2.1.2 Forward-Backward
3.2.1.3 Coloring
3.2.1.4 Other Parallel SCC Approaches
3.2.2 Connected and Weakly Connected Components
3.2.3 Biconnected Components
3.3 Multistep Method For SCC Decomposition
3.3.1 Observations
3.3.2 Description of Method
3.3.3 Work Complexity
3.4 Applying the Multistep Method
3.4.1 Trim Step
3.4.2 Breadth-First Search
3.4.3 Coloring
3.4.4 Serial Step
3.4.5 Connected Components and Weakly Connected Components
3.4.6 Biconnected Components
3.5 Experimental Setup
3.6 Experimental Results
3.6.1 Strongly Connected Component Decomposition
3.6.2 Connected Components Decomposition
3.6.3 Biconnected Component Decomposition
3.7 Conclusions
References
Chapter 4 Distributed Memory Parallel Algorithms for Massive Graphs
4.1 Introduction
4.2 Generating Random Graphs
4.2.1 Erdös–Rényi Model
4.2.1.1 Efficient Sequential Algorithm
4.2.1.2 Distributed Memory Parallel Algorithm
4.2.2 Preferential Attachment Model
4.2.3 Switching Edges of a Graph and Generating Random Graphs with Given Degree Sequences
4.3 Counting Triangles
4.3.1 An Algorithm using MPI
4.3.2 Algorithms using MapReduce
4.4 Community Detection
4.5 Giraph: A Distributed Graph Processing Framework
4.5.1 Parallel Algorithm for the Single Source Shortest Path Problem using Giraph
4.6 Breadth-First Search
References
Chapter 5 Efficient Multi-core Algorithms for Computing Spanning Forests and Connected Components
5.1 Introduction
5.2 Previous Work
5.3 Union-Find Algorithms
5.4 Parallelizing Rem’s Algorithm
5.4.1 Using Locks
5.4.2 A New Lock-Free Approach
5.4.3 A Local Algorithm
5.5 Experiments
5.6 Concluding Remarks
Acknowledgments
References
Chapter 6 Massive-Scale Distributed Triangle Computation and Applications
6.1 Introduction
6.1.1 Preliminary Definitions and Terminology
6.1.2 Motivation for Enumerating Triangles
6.1.3 System and Algorithm Design Criteria
6.2 Algorithmic Primer for Exact Computation
6.2.1 IEEE HPEC Graph Challenge
6.2.2 Classical Graph Algorithms
6.2.3 Linear-Algebraic Techniques
6.2.3.1 Sparse Linear Algebra
6.2.3.2 Dense Linear Algebra
6.3 Distributed Asynchronous Triangle Enumeration
6.3.1 An Asynchronous Distributed Graph Framework
6.3.1.1 Communication
6.3.1.2 Effects of Aggregation and Routing
6.3.2 Triangle Counting at Massive Scale
6.3.2.1 1.5M CPUs on IBM BG/Q Sequoia
6.3.2.2 Scaling on a Linux Cluster
6.3.3 Asynchronous Distributed κ-Truss Decomposition
6.3.4 Avoiding Communication
6.4 Approximate Triangle Counting
6.4.1 Streaming Graph Algorithms
6.4.2 A Historical Survey of Approximate Triangle Counting
6.4.2.1 Sparsification Algorithms
6.4.2.2 Global Approximation Algorithms
6.4.2.3 Vertex-Local Approximation Algorithms
6.4.3 Graph Sketching
6.4.3.1 Triangle Count Estimation via Sketches
6.4.4 Discussion of Sampling Versus Sketching
References
SECTION III: Algorithms and Applications
Chapter 7 Computing Top-k Closeness Centrality in Fully Dynamic Graphs
7.1 Introduction
7.2 Preliminaries
7.2.1 Notation and Problem Definition
7.2.2 Related Work
7.2.3 Static Top-k Closeness
7.3 Dynamic Top-k Closeness Centrality
7.3.1 Updating the Number of Reachable Vertices
7.3.2 Finding Affected Vertices
7.3.3 Update After an Edge Insertion
7.3.4 Update After an Edge Removal
7.3.5 Running Times and Memory Requirements
7.4 Experiments
7.4.1 Experimental Setup
7.4.2 Speedups on Recomputation
7.5 Conclusions
Acknowledgments
7.6 Overview of Networks
7.7 Additional Experimental Results
7.7.1 Speedups for Complex Networks
7.7.2 Speedups for Road Networks
7.7.3 Running Times for Complex Networks
7.7.4 Running Times for Road Networks
References
Chapter 8 Ordering Heuristics for Parallel Graph Coloring
8.1 Introduction
8.1.1 Ordering Heuristics
8.1.2 Parallel Greedy Coloring
8.1.3 Outline
8.2 The Jones-Plassmann Algorithm
8.2.1 The Dag Model of Dynamic Multithreading
8.2.2 Analysis of JP
8.3 JP with Random Ordering
8.4 The LF and SL Heuristics
8.4.1 The LF Ordering Heuristic
8.4.2 The SL Ordering Heuristic
8.5 Log Ordering Heuristics
8.5.1 The LLF Ordering Heuristic
8.5.2 The SLL Ordering Heuristic
8.6 Empirical Evaluation
8.6.1 Experimental Setup
8.6.2 Coloring Quality of R, LLF, and SLL
8.6.3 Scalability of JP-R, JP-LLF, and JP-SLL
8.7 Implementation Techniques
8.7.1 Join Trees for Reducing Memory Contention
8.7.2 Bit Vectors for Assigning Colors
8.7.3 Software Prefetching
8.8 The SD Heuristic
8.9 Related Work
8.10 Conclusion
Acknowledgments
8.11 Appendix: Performance of Serial Ordering Heuristics
References
Chapter 9 Partitioning Trillion-Edge Graphs
9.1 Introduction
9.2 Background
9.2.1 Graph Partitioning
9.2.2 Related Work
9.3 XTRAPULP
9.3.1 XTRAPULP Overview
9.3.1.1 Graph Representation
9.3.1.2 XTRAPULP Algorithm
9.3.2 XTRAPULP Initialization
9.3.2.1 ExchangeUpdates
9.3.3 XTRAPULP Vertex-Balancing Phase
9.3.4 XTRAPULP Refinement Phase
9.3.5 XTRAPULP Edge-Balancing Phase
9.4 Experimental Setup
9.5 Results
9.5.1 Performance and Scalability
9.5.1.1 Scaling on Blue Waters
9.5.1.2 Trillion-Edge Runs
9.5.1.3 Scaling on Compton
9.5.2 Partitioning Quality
9.6 Conclusions
References
Chapter 10 New Phenomena in Large-Scale Internet Traffic
10.1 Introduction
10.2 Methodology
10.2.1 MAWI and CAIDA Internet Traffic Collection
10.2.2 Network Quantities from Matrices
10.2.3 Memory and Computation Requirements
10.3 Internet Traffic Modeling
10.3.1 Logarithmic Pooling
10.3.2 Modified Zipf–Mandelbrot Model
10.3.3 Nonlinear Model Fitting
10.4 Results
10.5 Discussion
10.6 Conclusions
Acknowledgments
References
Chapter 11 Parallel Algorithms for Butterfly Computations
11.1 Introduction
11.2 Preliminaries
11.3 PARBUTTERFLY Framework
11.3.1 Counting Framework
11.3.1.1 Ranking
11.3.1.2 Wedge Aggregation
11.3.1.3 Butterfly Aggregation
11.3.1.4 Other Options
11.3.2 Peeling Framework
11.4 PARBUTTERFLY Algorithms
11.4.1 Preprocessing
11.4.2 Counting Algorithms
11.4.2.1 Wedge Retrieval
11.4.2.2 Per Vertex
11.4.2.3 Per Edge
11.4.3 Peeling Algorithms
11.4.3.1 Per Vertex
11.4.3.2 Per Edge
11.4.3.3 Per Vertex/Edge Storing All Wedges
11.4.4 Approximate Counting
11.4.5 Approximate Degree Ordering
11.4.6 Complement Degeneracy Ordering
11.5 Parallel Fibonacci Heap
11.5.1 Batch Insertion
11.5.2 Delete-min
11.5.3 Batch Decrease-Key
11.5.4 Application to Bucketing
11.5.4.1 Retrieving the Minimum Bucket
11.5.4.2 Updating the Bucketing Structure
11.6 Experiments
11.6.1 Environment
11.6.2 Results
11.6.2.1 Butterfly Counting
11.6.2.2 Ranking
11.6.2.3 Approximate Counting
11.6.2.4 Butterfly Peeling
11.6.3 Cache Optimization for Butterfly Counting
11.6.4 Butterfly Counting
11.6.4.1 Ranking
11.6.4.2 Approximate Counting
11.7 Related Work
11.8 Conclusion
Acknowledgments
References
SECTION IV: Models
Chapter 12 Recent Advances in Scalable Network Generation
12.1 Introduction
12.2 Graph Properties and Uses of Generators
12.2.1 Graph Properties
12.2.2 Use Cases
12.3 General Algorithmic Models and Techniques
12.3.1 Models of Computation
12.3.2 Random Permutations
12.3.3 Basic Sampling Techniques
12.3.3.1 Bernoulli Sampling
12.3.3.2 Sampling k Elements from [N]
12.3.3.3 Rejection Sampling
12.3.3.4 Weighted Sampling
12.3.4 Sampling from Huge Sets
12.4 Basic Models
12.4.1 Erdös-Rényi’s G(n,m) and Gilbert’s G(n, p) models
12.4.2 Preferential Attachment
12.4.2.1 Barabási-Albert Model
12.4.2.2 Node Copy Model
12.5 Random Spatial Graphs
12.5.1 Random Geometric Graphs
12.5.2 Random Hyperbolic Graphs
12.5.2.1 Threshold Model
12.5.2.2 Binomial Model
12.5.3 Geometric Inhomogenous Random Graphs
12.5.4 Random Planar Graphs
12.5.4.1 Random Delaunay Triangulations
12.5.4.2 Planar Graphs for Infrastructures
12.6 Random Graphs with Prescribed Degree Sequences
12.6.1 Chung-Lu
12.6.2 Configuration Model
12.6.3 Edge Switching
12.6.4 Curveball and Global Curveball
12.7 Block Models
12.7.1 Stochastic Block Model
12.7.2 R-MAT / Kronecker Graphs
12.7.3 LFR
12.7.4 CKB
12.8 Graph Replication
12.8.1 BTER
12.8.2 Darwini
12.8.3 Multilevel Generators
12.8.4 dK-Graphs
12.9 Additional Graph Types
12.9.1 Directed Graphs
12.9.2 Weighted Graphs
12.9.3 Connected Graphs
12.9.4 Regular Graphs
12.9.5 Threshold Graphs
12.10 Software Packages
12.11 Future Challenges
Acknowledgments
References
Chapter 13 Computational Models for Cascades in Massive Graphs: How to Spread a Rumor in Parallel
13.1 Introduction
13.2 Models for Rumor Diffusion and Epidemics
13.3 Cascade Dynamics in a Network
13.3.1 Independent Cascade model
13.3.2 Threshold Models
13.3.3 Complex Contagion Model
13.4 A Unified Model for Rumor Cascades
13.4.1 Analytical Model of Influence
13.4.2 Unified Model of Influence
13.4.3 Infection Probability Formula Under the Unified Model
13.4.3.1 Complexity Analysis
13.4.4 Reduction to Other Models
13.4.4.1 Independent Cascade Model
13.4.4.2 Threshold Models
13.4.4.3 Complex Contagion Model
13.4.5 Empirical Evaluation of the Unified Model
13.5 Parallel Temporal Rumor Cascade Analysis in Massive Graphs
13.6 Conclusions
Acknowledgment
References
Chapter 14 Executing Dynamic Data-Graph Computations Deterministically Using Chromatic Scheduling
14.1 Introduction
14.1.1 A Serial Reference Implementation
14.1.2 Parallelizing Dynamic Data-Graph Computations
14.1.3 Contributions
14.1.4 Outline
14.2 Background
14.2.1 The Dag Model of Multithreading
14.2.2 Work-Span Analysis
14.2.3 Work-Stealing Runtime Systems
14.2.4 Worker-Local Storage
14.3 The PRISM Algorithm
14.3.1 Design Considerations for the Implementation of Multibags
14.4 The Multibag Data Structure
14.4.1 Implementation of MB-INSERT and MB-COLLECT
14.4.2 Analysis of Multibags
14.5 Analysis of PRISM
14.5.1 Theoretical Scalability of PRISM
14.6 Empirical evaluation
14.6.1 Experimental Setup
14.6.2 Overheads for Locking and for Chromatic Scheduling
14.6.3 Dynamic Scheduling Overhead
14.6.4 Scalability of PRISM
14.7 The Prism-R Algorithm
14.7.1 Data-Graph Computations with Global Reductions
14.7.2 PRISM-R
14.8 The multivector data structure
14.8.1 The Log-Tree Reducer
14.8.2 Analysis of Multivector Operations
14.9 Analysis and evaluation of Prism-R
14.9.1 Work-Span Analysis of Prism-R
14.9.2 Empirical Evaluation of PRISM-R
14.10 Conclusion
14.11 Acknowledgments
References
SECTION V: Frameworks and Software
Chapter 15 Graph Data Science Using Neo4j
15.1 Introduction
15.1.1 What is Graph Data Science?
15.1.2 Types of Questions for GDS
15.1.3 The Rise of GDS
15.2 Neo4j for GDS
15.2.1 Neo4j Database and Labeled Property Graph
15.2.2 Cypher Declarative Query Language
15.2.3 Neo4j GDS Library
15.2.4 Neo4j Browser and Neo4j Bloom
15.3 The GDS Journey
15.3.1 Knowledge Graphs
15.3.2 Graph Analytics
15.3.3 Graph Feature Engineering
15.3.4 Graph Embedding
15.3.5 Graph Networks
15.4 Real-World Uses of GDS
15.4.1 Prominent Commercial Uses
15.4.1.1 Drug Discovery and Patient Journeys
15.4.1.2 Recommendations and Targeted Marketing
15.4.1.3 Fraud Detection
15.5 Fraud Example
15.5.1 Query Dataset
15.5.2 Outlier Removal
15.5.3 Finding Suspicious Clusters
15.5.4 Visually Exploring a Suspicious Cluster
15.5.5 Predicting Fraudsters Using Graph Features
15.6 Conclusion
References
Chapter 16 The Parallel Boost Graph Library 2.0: Active Messages as a Spanning Model for Parallel Graph Computation
16.1 Introduction
16.2 Related
16.2.1 Active Messages
16.2.2 Parallel Graph Libraries
16.2.3 Active Messages for Graph Algorithms
16.3 Programming Model
16.3.1 Background: Active Pebbles
16.4 Design of a Graph Application Stack
16.4.1 Distributed Data Model
16.4.2 Runtime Communication Transformations
16.4.3 Separating Expression from Execution
16.4.3.1 Message Configuration
16.5 Experimental Evaluation
16.5.1 Implementation Details
16.5.2 Results
16.5.2.1 Breadth-First Search
16.5.2.2 Δ-Stepping Shortest Paths
16.5.2.3 Shiloach-Vishkin Connected Components
16.5.2.4 PageRank
16.6 Conclusion
References
Chapter 17 RAPIDS cuGraph
17.1 Introduction
17.2 Data Science and Workflow Integration
17.2.1 Ecosystem
17.2.2 Workflow Integration
17.3 Technology Stack and Software
17.3.1 Transformations and Renumbering
17.3.2 Stack
17.4 Performance results
17.4.1 Single GPU Performance
17.4.2 Multiple GPU Performance
17.5 Conclusion
Acknowledgment
References
Chapter 18 A Cloud-Based Approach to Big Graphs
18.1 Introduction
18.2 Background
18.2.1 Graph definition
18.2.2 Memory Latency Challenges
18.2.3 Scalability Challenges
18.2.4 Fault-Tolerance
18.3 Cloud-Based Approach
18.4 Related Work
18.5 A Few Words on MapReduce
18.6 Graph Representation
18.6.1 Edge list
18.6.2 Accumulo Edge Table
18.7 Breadth-First Search
18.7.1 Cloud-Based BFS
18.7.2 Accumulo Adjacency
18.7.3 Key-Space Partitioning
18.7.4 Reduce Task Count
18.8 Results and Discussion
18.8.1 Graph500 Benchmark
18.8.2 Experiments
18.8.3 Results
18.9 Conclusion
References
Chapter 19 Introduction to GraphBLAS
19.1 Adjacency Matrix
19.2 Incidence Matrix
19.3 Matrix Values
19.4 Scalar Operations
19.5 Matrix Properties
19.6 0-Element: No Graph Edge
19.7 Matrix Graph Operations
19.7.1 Building a Matrix: Edge List to Graph
19.7.2 Extracting Tuples: Graph to Vertex List
19.7.3 Transpose: Swap Out-Vertices and In-Vertices
19.7.4 Matrix Multiplication: Breadth-First-Search, and Adjacency Matrix Construction
19.7.5 Extract: Selecting Sub-graphs
19.7.6 Assign: Modifying Sub-Graphs
19.7.7 Element-Wise Addition and Element-Wise Multiplication: Combining Graphs, Intersecting Graphs, and Scaling Graphs
19.8 Performance
19.9 Conclusions
References
Chapter 20 Graphulo: Linear Algebra Graph Kernels
20.1 Introduction
20.1.1 The Graphulo Initiative
20.1.2 Outline
20.2 Associative Arrays and Graph Schemas
20.2.1 Associative Arrays
20.2.2 Graph Schemas
20.2.2.1 Adjacency Matrix
20.2.2.2 Incidence Matrix
20.2.2.3 D4M Schema
20.3 Algorithms
20.3.1 Centrality
20.3.2 Subgraph Detection and Vertex Nomination
20.3.3 Similarity
20.3.4 Community Detection
20.3.5 Graph Algorithms and Neural Networks
20.4 Graphulo Implementation
20.4.1 Graphulo
20.4.2 D4M
20.5 Discussion
20.6 Conclusions
Acknowledgments
References
Chapter 21 Interactive Graph Analytics at Scale in Arkouda
21.1 Introduction
21.2 Arkouda Framework for Data Science
21.3 Succinct Double-Index Data Structure
21.3.1 Edge Index and Vertex Index
21.3.2 Time and Space Complexity Analysis
21.3.3 Comparison with CSR
21.4 Multilocale Breadth-First Search and Triangle Counting Algorithms
21.4.1 Parallel BFS Algorithm
21.4.1.1 High-Level Multilocale BFS Algorithm
21.4.1.2 Low-level Multilocale BFS Algorithm
21.4.2 Parallel Triangle Counting Algorithm
21.5 Graph Analytics Workflow
21.5.1 Edge Oriented Sparse Graph Partitioning and Building
21.5.2 Sliding Window Stream based Sketch Building
21.5.3 Real-World Graph Distributions based Regression Model
21.6 Integration with Arkouda
21.6.1 Graph Classes Definition in Python and Chapel
21.6.2 BFS and Triangle Counting Benchmark
21.7 Experiments
21.7.1 Experimental Setup for BFS Analysis
21.7.2 Experimental Results of BFS Algorithm
21.7.2.1 Graph Building
21.7.2.2 BFS Performance
21.7.3 Experimental Setup for Triangle Counting
21.7.4 Experimental Results of Triangle Counting
21.7.4.1 Accuracy
21.7.4.2 Performance
21.8 Related Work
21.8.1 BFS Algorithm
21.8.2 Triangle Counting Algorithm
21.8.3 Graph Stream Sketch
21.8.4 Complete Graph Stream Processing Method
21.9 Conclusion
Acknowledgments
References
Index
Blank Page