High Performance Computing in Clouds: Moving HPC Applications to a Scalable and Cost-Effective Environment

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

This book brings a thorough explanation on the path needed to use cloud computing technologies to run High-Performance Computing (HPC) applications. Besides presenting the motivation behind moving HPC applications to the cloud, it covers both essential and advanced issues on this topic such as deploying HPC applications and infrastructures, designing cloud-friendly HPC applications, and optimizing a provisioned cloud infrastructure to run this family of applications. Additionally, this book also describes the best practices to maintain and keep running HPC applications in the cloud by employing fault tolerance techniques and avoiding resource wastage.  

To give practical meaning to topics covered in this book, it brings some case studies where HPC applications, used in relevant scientific areas like Bioinformatics and Oil and Gas industry were moved to the cloud. Moreover, it also discusses how to train deep learning models in the cloud elucidating the key components and aspects necessary to train these models via different types of services offered by cloud providers.

Despite the vast bibliography about cloud computing and HPC, to the best of our knowledge, no existing manuscript has comprehensively covered these topics and discussed the steps, methods and strategies to execute HPC applications in clouds. Therefore, we believe this title is useful for IT professionals and students and researchers interested in cutting-edge technologies, concepts, and insights focusing on the use of cloud technologies to run HPC applications.

Author(s): Edson Borin, Lúcia Maria A. Drummond, Jean-Luc Gaudiot, Alba Melo, Maicon Melo Alves, Philippe Olivier Alexandre Navaux
Publisher: Springer
Year: 2023

Language: English
Pages: 336
City: Cham

Foreword
Preface
Contents
1 Why Move HPC Applications to the Cloud?
1.1 Book Organization
References
Part I Foundations
2 What Is Cloud Computing?
2.1 First Look at the Cloud
2.1.1 Origin
2.1.2 Definition
2.2 Benefits and Drawbacks
2.2.1 Cost Savings
2.2.2 Elasticity
2.2.3 Drawbacks
2.3 Service and Delivery Models
2.3.1 Service Models
2.3.2 Delivery Models
2.4 Virtualization and Containers Technologies
2.4.1 Virtualization
2.4.2 Containers
2.5 Final Remarks
References
3 What Do HPC Applications Look Like?
3.1 About High-Performance Computing and Its Way So Far
3.1.1 Concept and Motivations
3.1.2 Evolution of HPC Systems
3.1.3 Graphical Programming Unit as the Main HPC Accelerator
3.1.4 Overview of Current HPC Systems and Associated Concerns
3.2 Design and Performance
3.2.1 Methodology for the Design of HPC Applications
3.2.2 Synopsis of HPC Programming
3.2.3 Critical Numerical and Performance Challenges
3.2.4 About Parallel Efficiency
3.3 Two Examples of HPC Applications
3.3.1 Lattice Quantum ChromoDynamics (LQCD)
3.3.2 High-Resolution Seismic Imaging
3.4 HPC and Cloud Computing
References
Part II Running HPC Applications in Cloud
4 Deploying and Configuring Infrastructure
4.1 Introduction
4.2 Key Infrastructure Elements
4.2.1 Virtual Machines
4.2.1.1 Virtual Machine Images
4.2.2 Regions, Availability Zones, and Placement Strategies
4.2.3 Tenancy
4.2.4 Storage Services
4.2.5 Virtual Private Cloud Networks
4.3 Overview of a Cloud-Based HPC Cluster
4.3.1 Cost and Performance of Cloud-Based HPC Clusters
4.4 Deploying Infrastructure on the IaaS Model
4.4.1 GUI and Command-Line Interface Tools
4.4.2 Infrastructure as Code
4.4.3 IaC Tools for Cloud HPC-Cluster-Like Environments
4.5 Considerations About Selecting Resources and Tools to Deploy HPC Systems on the Cloud
References
5 Executing Traditional HPC Application Code in Cloud with Containerized Job Schedulers
5.1 Introduction
5.1.1 Foreword
5.1.2 Chapter Organization
5.2 Change Nothing at the Application Level but a Little at the Cloud Orchestrator Level
5.2.1 Introduction
5.2.2 Elements of Vocabulary and Essential Definitions
5.2.2.1 Basic Vocabulary Regarding the Notion of HPC Jobs and HPC Job Schedulers
5.2.2.2 Overview of Containers and Cloud Orchestrator
5.2.2.3 Overview of Kubernetes, Slurm, OAR and OpenPBS
5.2.3 Related Works
5.2.4 Challenges, Issues, and Solutions
5.2.4.1 Motivation
5.2.4.2 Propositions
5.2.4.3 Containerized HPC Schedulers
5.2.4.4 Dynamic Containerized of HPC Clusters
5.2.4.5 Impact on Pending Jobs
5.2.4.6 Impact on Running Jobs
5.2.4.7 Towards a General Methodology to Containerize HPC Job Schedulers
5.2.5 Summary of the Discussion
5.3 Adding a Mechanism for Autoscaling for Containerized HPC Schedulers
5.3.1 Introduction
5.3.2 Related Works and Positioning
5.3.3 Challenge and Issues for Auto Scaling Mechanisms with OAR
5.3.4 Summary of the Discussion
5.4 Conclusion
References
6 Designing Cloud-Friendly HPC Applications
6.1 Introduction
6.2 Exploring Cloud Features and Capabilities Through the Lens of HPC Demands
6.3 Analyzing HPC Models to Write Cloud-Friendly Applications
6.4 Loosely-Coupled HPC Applications for Cloud
6.4.1 Bag-of-Tasks
6.4.2 Master-Slave
6.4.3 Pipeline
6.4.4 Divide-and-Conquer
6.5 Tightly-Coupled HPC Applications for Cloud
6.5.1 Bulk-Synchronous Parallel
6.6 Discussion and Open Challenges on HPC-Oriented Cloud Applications
6.7 Conclusion
References
7 Exploiting Hardware Accelerators in Clouds
7.1 Introduction
7.2 Accelerator Optimized Instances on the Cloud
7.2.1 GPUs: Graphic Processing Units
7.2.2 TPUs: Tensor Processing Units
7.2.3 FPGAs: Field-Programmable Gate Arrays
7.2.4 Other Cloud Providers Accelerators and AIprocessors
7.3 Programming for Cloud Accelerators
7.3.1 Amazon Web Services (AWS)
7.3.2 Google Cloud Platform (GCP)
7.3.3 Microsoft Azure
7.4 Influence of Accelerators in IoT and Edge Computing
7.5 Final Remarks
References
Part III Cost and Performance Optimizations
8 Optimizing Infrastructure for MPI Applications
8.1 Fundamentals of MPI
8.2 Interconnection Networks for MPI Environments
8.3 Cloud Facilities for MPI Applications
8.4 Executing an MPI Job in the Cloud
8.5 Optimizing the Performance of MPI Applications on theCloud
8.6 Conclusions
References
9 Harnessing Low-Cost Virtual Machines on the Spot
9.1 Introduction
9.2 Spot VMs
9.2.1 Using Hibernation-Prone Spot VMs in BoTApplications
9.3 Reducing Monetary Costs Within Markets
9.3.1 Instances Galore and the Paradox of Choice
9.3.2 Choosing the ``Right'' Instance May Not Be Enough
9.4 Burstables Virtual Machines
9.5 Conclusions and Future Directions
References
10 Ensuring Application Continuity with Fault ToleranceTechniques
10.1 Introduction
10.2 Fault Tolerance
10.2.1 Failure Detection
10.2.2 Checkpointing
10.2.3 Replication
10.2.4 Fault Tolerant MPI
10.2.5 Fault Tolerance in HPC Applications
10.3 Fault Tolerance in Clouds
10.3.1 Failure Detectors in Clouds
10.3.2 Implementing Checkpoints in Cloud
10.3.2.1 Bag-of-Tasks Applications
10.3.3 Reliable Cloud Storage Solutions
10.3.3.1 Choice of the Storage Service
10.3.4 Replication
10.3.5 Fault Tolerance and Preemptible VMs
10.4 Conclusion and Future Directions
References
11 Avoiding Resource Wastage
11.1 Introduction
11.2 HPC Workload Characteristics and Resource Wastage
11.2.1 Typical HPC Workloads
11.2.2 Sources of Resource Wastage in HPC Cloud
11.2.3 Resource Management
11.3 Strategies to Detect and Prevent Resource Wastage
11.3.1 Metrics to Detect Resource Wastage
11.3.2 Resource Optimisation Strategies
11.3.3 Research Challenges
11.4 Conclusions
References
Part IV Application Study Cases
12 Biological Sequence Comparison on Cloud-Based GPUEnvironment
12.1 Introduction
12.2 Amazon Web Services
12.2.1 Overview
12.2.2 GPU Instances on AWS
12.2.3 Application Execution on AWS
12.2.4 High-Performance Computing on AWS
12.2.4.1 Fault Tolerance
12.2.4.2 Application Isolation
12.3 Case Study: Biological Sequence Comparison Application
12.3.1 Overview
12.3.2 Reducing the Monetary Costs
12.3.3 Reducing the Execution Time
12.4 Experimental Results
12.4.1 Reducing the Monetary Costs
12.4.2 Reducing the Execution Time
12.4.3 Discussion
12.5 Conclusions
References
13 Reservoir Simulation in the Cloud
13.1 Introduction
13.2 Reservoir Simulation Overview
13.2.1 Reservoir Simulation Software
13.2.2 Reservoir Simulation Challenges
13.3 Cloud Advantages and Challenges for the O&G Industry
13.4 Cloud Deploy Case Study of Reservoir Simulation
13.5 Conclusions and Future Trends
References
14 Cost Effective Deep Learning on the Cloud
14.1 Introduction
14.2 Key Deep Learning Concepts
14.2.1 Training Deep Learning Models
14.2.2 Model Partitioning Strategies for Distributed Training
14.3 Training Deep Learning Models in the Cloud
14.3.1 Services for Deep Learning in the Cloud
14.3.2 Training with IaaS
14.3.3 Training with SageMaker
14.4 Optimizing Cost and Training Time
14.4.1 Study Case: Medical Image Segmentation with MONAI
14.4.2 Searching for a Cost-Efficient Infrastructure
14.4.3 Selecting Efficient VM Types on EC2 and SageMaker
14.4.4 Exploring Cost and Training Time with Distributed Training
14.4.5 Reducing the Cost with Preemptible VMs
14.5 Final Considerations
References
A Deploying an HPC Cluster on AWS
A.1 Deploying Infrastructure Using the Web Console
A.1.1 Creating the VPC Network
A.1.2 Creating a Shared File System Using the AWS Elastic File System (EFS)
A.1.3 Instantiating Virtual Machines
A.2 Deploying Infrastructure Using the AWS Command-Line Interface
A.3 Deploying Infrastructure Using Ansible
B Configuring a Cloud-Deployed HPC Cluster
B.1 Introduction
B.2 Configuring the Cluster Using the Command-Line Interface
B.2.1 Mounting the EFS File System
B.2.2 Configuring SSH for Password-Less Connections
B.2.3 Installing and Configuring MUNGE
B.2.4 Installing and Configuring SLURM
B.3 Configuring the Cluster Using Ansible
B.3.1 Creating the Playbook Inventory
B.3.2 Configuring the HPC Cluster
B.3.3 Executing the Playbook
B.4 Submitting Jobs on the HPC Cluster