Many-Core Computing: Hardware and software

This document was uploaded by one of our users. The uploader already confirmed that they had the permission to publish it. If you are author/publisher or own the copyright of this documents, please report to us by using this DMCA report form.

Simply click on the Download Book button.

Yes, Book downloads on Ebookily are 100% Free.

Sometimes the book is free on Amazon As well, so go ahead and hit "Search on Amazon"

Computing has moved away from a focus on performance-centric serial computation, instead towards energy-efficient parallel computation. This provides continued performance increases without increasing clock frequencies, and overcomes the thermal and power limitations of the dark-silicon era. As the number of parallel cores increases, we transition into the many-core computing era. There is considerable interest in developing methods, tools, architectures and applications to support many-core computing.

The primary aim of this edited book is to provide a timely and coherent account of the recent advances in many-core computing research. Starting with programming models, operating systems and their applications; the authors present runtime management techniques, followed by system modelling, verification and testing methods, and architectures and systems. The book ends with some examples of innovative applications.

Author(s): Bashir M. Al-Hashimi, Geoff V. Merrett
Series: IET Professional Applications of Computing Series, 22
Publisher: The Institution of Engineering and Technology
Year: 2019

Language: English
Pages: 601
City: London

Cover
Contents
Preface
Part I Programming models, OS and applications
1 HPC with many core processors
1.1 MPI+OmpSs interoperability
1.2 The interposition library
1.3 Implementation of the MPI+OmpSs interoperability
1.4 Solving priority inversion
1.5 Putting it all together
1.6 Machine characteristics
1.7 Evaluation of NTChem
1.7.1 Application analysis
1.7.2 Parallelization approach
1.7.3 Performance analysis
1.8 Evaluation with Linpack
1.8.1 Application analysis
1.8.2 Parallelization approach
1.8.3 Performance analysis
1.9 Conclusions and future directions
Acknowledgments
References
2 From irregular heterogeneous software to reconfigurable hardware
2.1 Outline
2.2 Background
2.2.1 OpenCL's hierarchical programming model
2.2.2 Executing OpenCL kernels
2.2.3 Work-item synchronisation
2.3 The performance implications of mapping atomic operations to reconfigurable hardware
2.4 Shared virtual memory
2.4.1 Why SVM?
2.4.2 Implementing SVM for CPU/FPGA systems
2.4.3 Evaluation
2.5 Weakly consistent atomic operations
2.5.1 OpenCL's memory consistency model
2.5.1.1 Executions
2.5.1.2 Consistent executions
2.5.1.3 Data races
2.5.2 Consistency modes
2.5.2.1 The acquire and release consistency modes
2.5.2.2 The seq-cst consistency mode
2.5.2.3 The relaxed consistency mode
2.5.3 Memory scopes
2.5.4 Further reading
2.6 Mapping weakly consistent atomic operations to reconfigurable hardware
2.6.1 Scheduling constraints
2.6.2 Evaluation
2.7 Conclusion and future directions
Acknowledgements
References
3 Operating systems for many-core systems
3.1 Introduction
3.1.1 Many-core architectures
3.1.2 Many-core programming models
3.1.3 Operating system challenges
3.2 Kernel-state synchronization bottleneck
3.3 Non-uniform memory access
3.4 Core partitioning and management
3.4.1 Single OS approaches
3.4.2 Multiple OS approaches
3.5 Integration of heterogeneous computing resources
3.6 Reliability challenges
3.6.1 OS measures against transient faults
3.6.2 OS measures against permanent faults
3.7 Energy management
3.7.1 Hardware mechanisms
3.7.2 OS-level power management
3.7.3 Reducing the algorithmic complexity
3.8 Conclusions and future directions
References
4 Decoupling the programming model from resource management in throughput processors
4.1 Introduction
4.2 Background
4.3 Motivation
4.3.1 Performance variation and cliffs
4.3.2 Portability
4.3.3 Dynamic resource underutilization
4.3.4 Our goal
4.4 Zorua: our approach
4.4.1 Challenges in virtualization
4.4.2 Key ideas of our design
4.4.2.1 Leveraging software annotations of phase characteristics
4.4.2.2 Control with an adaptive runtime system
4.4.3 Overview of Zorua
4.5 Zorua: detailed mechanism
4.5.1 Key components in hardware
4.5.2 Detailed walkthrough
4.5.3 Benefits of our design
4.5.4 Oversubscription decisions
4.5.5 Virtualizing on-chip resources
4.5.5.1 Virtualizing registers and scratchpad memory
4.5.5.2 Virtualizing thread slots
4.5.6 Handling resource spills
4.5.7 Supporting phases and phase specifiers
4.5.8 Role of the compiler and programmer
4.5.9 Implications to the programming model and software optimization
4.5.9.1 Flexible programming models for GPUs and heterogeneous systems
4.5.9.2 Virtualization-aware compilation and auto-tuning
4.5.9.3 Reduced optimization space
4.6 Methodology
4.6.1 System modeling and configuration
4.6.2 Evaluated applications and metrics
4.7 Evaluation
4.7.1 Effect on performance variation and cliffs
4.7.2 Effect on performance
4.7.3 Effect on portability
4.7.4 A deeper look: benefits and overheads
4.8 Other applications
4.8.1 Resource sharing in multi-kernel or multi-programmed environments
4.8.2 Preemptive multitasking
4.8.3 Support for other parallel programming paradigms
4.8.4 Energy efficiency and scalability
4.8.5 Error tolerance and reliability
4.8.6 Support for system-level tasks on GPUs
4.8.7 Applicability to general resource management in accelerators
4.9 Related work
4.10 Conclusion and future directions
Acknowledgments
References
5 Tools and workloads for many-core computing
5.1 Single-chip multi/many-core systems
5.1.1 Tools
5.1.2 Workloads
5.2 Multi-chip multi/many-core systems
5.2.1 Tools
5.2.2 Workloads
5.3 Discussion
5.4 Conclusion and future directions
5.4.1 Parallelization of real-world applications
5.4.2 Domain-specific unification of workloads
5.4.3 Unification of simulation tools
5.4.4 Integration of tools to real products
References
6 Hardware and software performance in deep learning
6.1 Deep neural networks
6.2 DNN convolution
6.2.1 Parallelism and data locality
6.2.2 GEMM-based convolution algorithms
6.2.3 Fast convolution algorithms
6.3 Hardware acceleration and custom precision
6.3.1 Major constraints of embedded hardware CNN accelerators
6.3.2 Reduced precision CNNs
6.3.3 Bit slicing
6.3.4 Weight sharing and quantization in CNNs
6.3.5 Weight-shared-with-parallel accumulate shared MAC (PASM)
6.3.6 Reduced precision in software
6.4 Sparse data representations
6.4.1 L1-norm loss function
6.4.2 Network pruning
6.4.2.1 Fine pruning
6.4.2.2 Coarse pruning
6.4.2.3 Discussion
6.5 Program generation and optimization for DNNs
6.5.1 Domain-specific compilers
6.5.2 Selecting primitives
6.6 Conclusion and future directions
Acknowledgements
References
Part II Runtime management
7 Adaptive–reflective middleware for power and energy management in many-core heterogeneous systems
7.1 The adaptive–reflective middleware framework
7.2 The reflective framework
7.3 Implementation and tools
7.3.1 Offline simulator
7.4 Case studies
7.4.1 Energy-efficient task mapping on heterogeneous architectures
7.4.2 Design space exploration of novel HMPs
7.4.3 Extending the lifetime of mobile devices
7.5 Conclusion and future directions
Acknowledgments
References
8 Advances in power management of many-core processors
8.1 Parallel ultra-low power computing
8.1.1 Background
8.1.2 PULP platform
8.1.3 Compact model
8.1.4 Process and temperature compensation of ULP multi-cores
8.1.4.1 Compensation of process variation
8.1.5 Experimental results
8.2 HPC architectures and power management systems
8.2.1 Supercomputer architectures
8.2.2 Power management in HPC systems
8.2.2.1 Linux power management driver
8.2.3 Hardware power controller
8.2.4 The power capping problem in MPI applications
References
9 Runtime thermal management of many-core systems
9.1 Thermal management of many-core embedded systems
9.1.1 Uncertainty in workload estimation
9.1.2 Learning-based uncertainty characterization
9.1.2.1 Multinomial logistic regression model
9.1.2.2 Maximum likelihood estimation
9.1.2.3 Uncertainty interpretation
9.1.3 Overall design flow
9.1.4 Early evaluation of the approach
9.1.4.1 Impact of workload uncertainty: H. 264 case study
9.1.4.2 Thermal improvement considering workload uncertainty
9.2 Thermal management of 3D many-core systems
9.2.1 Recent advances on 3D thermal management
9.2.2 Preliminaries
9.2.2.1 Application model
9.2.2.2 Multiprocessor platform model
9.2.2.3 3D IC model
9.2.3 Thermal-aware mapping
9.2.3.1 Thermal profiling
9.2.3.2 Runtime
9.2.3.3 Application merging
9.2.3.4 Resource allocation
9.2.3.5 Throughput computation
9.2.3.6 Utilization minimization
9.2.4 Experimental results
9.2.4.1 Benchmark applications
9.2.4.2 Target 3D many-core system
9.2.4.3 Temperature simulation
9.2.4.4 Interconnect energy computation
9.2.4.5 Thermal profiling results
9.2.4.6 Benchmark application results
9.2.4.7 Case-study for real-life applications
9.3 Conclusions and future directions
References
10 Adaptive packet processing on CPU–GPU heterogeneous platforms
10.1 Background on GPU computing
10.1.1 GPU architecture
10.1.2 Performance considerations
10.1.3 CPU – GPU heterogeneous platforms
10.2 Packet processing on the GPU
10.2.1 Related work
10.2.2 Throughput vs latency dilemma
10.2.3 An adaptive approach
10.2.4 Offline building of the batch-size table
10.2.5 Runtime batch size selection
10.2.6 Switching between batch sizes
10.3 Persistent kernel
10.3.1 Persistent kernel challenges
10.3.2 Proposed software architecture
10.4 Case study
10.4.1 The problem of packet classification
10.4.2 The tuple space search (TSS) algorithm
10.4.3 GPU-based TSS algorithm
10.4.4 TSS persistent kernel
10.4.5 Experimental results
10.5 Conclusion and future directions
References
11 From power-efficient to power-driven computing
11.1 Computing is evolving
11.2 Power-driven computing
11.2.1 Real-power computing
11.2.1.1 Hard real-power computing
11.2.1.2 Soft real-power computing
11.2.2 Performance constraints in power-driven systems
11.3 Design-time considerations
11.3.1 Power supply models and budgeting
11.3.2 Power-proportional systems design
11.3.2.1 Computation tasks
11.3.2.2 Communication tasks
11.3.3 Power scheduling and optimisation
11.4 Run-time considerations
11.4.1 Adapting to power variations
11.4.2 Dynamic retention
11.5 A case study of power-driven computing
11.6 Existing research
11.7 Research challenges and opportunities
11.7.1 Power-proportional many-core systems
11.7.2 Design flow and automation
11.7.3 On-chip sensing and controls
11.7.4 Software and programming model
11.8 Conclusion and future directions
References
Part III System modelling, verification, and testing
12 Modelling many-core architectures
12.1 Introduction
12.2 Scale-out vs. scale-up
12.3 Modelling scale-out many-core
12.3.1 CPR model
12.3.2 α Model
12.4 Modelling scale-up many-core
12.4.1 PIE model
12.4.2 β Model
12.5 The interactions between scale-out and scale-up
12.5.1 Φ Model
12.5.2 Investigating the orthogonality assumption
12.6 Power efficiency model
12.6.1 Power model
12.6.2 Model calculation
12.7 Runtime management
12.7.1 MAX-P: performance-oriented scheduling
12.7.2 MAX-E: power efficiency-oriented scheduling
12.7.3 The overview of runtime management
12.8 Conclusion and future directions
Acknowledgements
References
13 Power modelling of multicore systems
13.1 CPU power consumption
13.2 CPU power management and energy-saving techniques
13.3 Approaches and applications
13.3.1 Power measurement
13.3.2 Top-down approaches
13.3.3 Circuit, gate, and register-transfer level approaches
13.3.4 Bottom-up approaches
13.4 Developing top-down power models
13.4.1 Overview of methodology
13.4.2 Data collection
13.4.2.1 Power and voltage measurements
13.4.2.2 PMC event collection
13.4.3 Multiple linear regression basics
13.4.4 Model stability
13.4.5 PMC event selection
13.4.6 Model formulation
13.4.7 Model validation
13.4.8 Thermal compensation
13.4.9 CPU voltage regulator
13.5 Accuracy of bottom-up power simulators
13.6 Hybrid techniques
13.7 Conclusion and future directions
References
14 Developing portable embedded software for multicore systems through formal abstraction and refinement
14.1 Introduction
14.2 Motivation
14.2.1 From identical formal abstraction to specific refinements
14.2.2 From platform-independent formal model to platform-specific implementations
14.3 RTM cross-layer architecture overview
14.4 Event-B
14.4.1 Structure and notation
14.4.1.1 Context structure
14.4.1.2 Machine structure
14.4.2 Refinement
14.4.3 Proof obligations
14.4.4 Rodin: event-B tool support
14.5 From identical formal abstraction to specific refinements
14.5.1 Abstraction
14.5.2 Learning-based RTM refinements
14.5.3 Static decision-based RTM refinements
14.6 Code generation and portability support
14.7 Validation
14.8 Conclusion and future directions
References
15 Self-testing of multicore processors
15.1 General-purpose multicore systems
15.1.1 Taxonomy of on-line fault detection methods
15.1.2 Non-self-test-based methods
15.1.3 Self-test-based methods
15.1.3.1 Hardware-based self-testing
15.1.3.2 Software-based self-testing
15.1.3.3 Hybrid self-testing methods (hardware/software)
15.2 Processors-based systems-on-chip testing flows and techniques
15.2.1 On-line testing of CPUs
15.2.1.1 SBST test library generation constraints
15.2.1.2 Execution management of the SBST test program
15.2.1.3 Comparison of SBST techniques for in-field test programs development
15.2.2 On-line testing of application-specific functional units
15.2.2.1 Floating-point unit
15.2.2.2 Test for FPU
15.2.2.3 Direct memory access
15.2.2.4 Error correction code
15.3 Conclusion and future directions
References
16 Advances in hardware reliability of reconfigurable many-core embedded systems
16.1 Background
16.1.1 Runtime reconfigurable processors
16.1.2 Single event upset
16.1.3 Fault model for soft errors
16.1.4 Concurrent error detection in FPGAs
16.1.5 Scrubbing of configuration memory
16.2 Reliability guarantee with adaptive modular redundancy
16.2.1 Architecture for dependable runtime reconfiguration
16.2.2 Overview of adaptive modular redundancy
16.2.3 Reliability of accelerated functions (AFs)
16.2.4 Reliability guarantee of accelerated functions
16.2.4.1 Maximum resident time
16.2.4.2 Acceleration variants selection
16.2.4.3 Non-uniform accelerator scrubbing
16.2.5 Reliability guarantee of applications
16.2.5.1 Effective critical bits of accelerators
16.2.5.2 Reliability of accelerated kernels
16.2.5.3 Effective critical bits of accelerated kernels and applications
16.2.5.4 Budgeting of effective critical bits
16.2.5.5 Budgeting for kernels
16.2.5.6 Budgeting for accelerated functions
16.2.6 Experimental evaluation
16.3 Conclusion and future directions
Acknowledgements
References
Part IV Architectures and systems
17 Manycore processor architectures
17.1 Introduction
17.2 Classification of manycore architectures
17.2.1 Homogeneous
17.2.2 Heterogeneous
17.2.3 GPU enhanced
17.2.4 Accelerators
17.2.5 Reconfigurable
17.3 Processor architecture
17.3.1 CPU architecture
17.3.1.1 Core pipeline
17.3.1.2 Branch prediction
17.3.1.3 Data parallelism
17.3.1.4 Multi-threading
17.3.2 GPU architecture
17.3.2.1 Unified shading architecture
17.3.2.2 Single instruction multiple thread (SIMT) execution model
17.3.3 DSP architecture
17.3.4 ASIC/accelerator architecture
17.3.5 Reconfigurable architecture
17.4 Integration
17.5 Conclusion and future directions
17.5.1 CPU
17.5.2 Graphics processing units
17.5.3 Accelerators
17.5.4 Field programmable gate array
17.5.5 Emerging architectures
References
18 Silicon photonics enabled rack-scale many-core systems
18.1 Introduction
18.2 Related work
18.3 RSON architecture
18.3.1 Architecture overview
18.3.2 ONoC design
18.3.3 Internode interface
18.3.4 Bidirectional and sharable optical transceiver
18.4 Communication flow and arbitration
18.4.1 Communication flow
18.4.2 Optical switch control scheme
18.4.3 Channel partition
18.4.4 ONoC control subsystem
18.5 Evaluations
18.5.1 Performance evaluation
18.5.2 Interconnection energy efficiency
18.5.3 Latency analysis
18.6 Conclusions and future directions
References
19 Cognitive I/O for 3D-integrated many-core system
19.1 Introduction
19.2 Cognitive I/O architecture for 3D memory-logic integration
19.2.1 System architecture
19.2.2 QoS-based I/O management problem formulation
19.3 I/O QoS model
19.3.1 Sparse representation theory
19.3.2 Input data dimension reduction by projection
19.3.3 I/O QoS optimization
19.3.4 I/O QoS cost function
19.4 Communication-QoS-based management
19.4.1 Cognitive I/O design
19.4.2 Simulation results
19.4.2.1 Experiment setup
19.4.2.2 Adaptive tuning by cognitive I/O
19.4.2.3 Adaptive I/O control by accelerated
19.5 Performance-QoS-based management
19.5.1 Dimension reduction
19.5.2 DRAM partition
19.5.3 Error tolerance
19.5.4 Feature preservation
19.5.5 Simulation results
19.6 Hybrid QoS-based management
19.6.1 Hybrid management via memory (DRAM) controller
19.6.2 Communication-QoS result
19.6.3 Performance-QoS result
19.7 Conclusion and future directions
References
20 Approximate computing across the hardware and software stacks
20.1 Introduction
20.2 Component-level approximations for adders and multipliers
20.2.1 Approximate adders
20.2.1.1 Low-power approximate adders
20.2.1.2 Low-latency approximate adders
20.2.2 Approximate multipliers
20.3 Probabilistic error analysis
20.3.1 Empirical vs. analytical methods
20.3.2 Accuracy metrics
20.3.3 Probabilistic analysis methodology
20.4 Accuracy configurability and adaptivity in approximate computing systems
20.4.1 Approximate accelerators with consolidated error correction
20.4.2 Adaptive datapaths
20.5 Multi-accelerator approximate computing architectures
20.5.1 Case study: an approximate accelerator architecture for High Efficiency Video Coding (HEVC)
20.6 Approximate memory systems and run-time management
20.6.1 Methodology for designing approximate memory systems
20.6.2 Case study: an approximation-aware multilevel cells cache architecture
20.7 A cross-layer methodology for designing approximate systems and the associated challenges
20.8 Conclusion
References
21 Many-core systems for big-data computing
21.1 Workload characteristics
21.2 Many-core architectures for big data
21.2.1 The need for many-core
21.2.2 Brawny vs wimpy cores
21.2.3 Scale-out processors
21.2.4 Barriers to implementation
21.3 The memory system
21.3.1 Caching and prefetching
21.3.2 Near-data processing
21.3.3 Non-volatile memories
21.3.4 Memory coherence
21.3.5 On-chip networks
21.4 Programming models
21.5 Case studies
21.5.1 Xeon Phi
21.5.2 Tilera
21.5.3 Piranha
21.5.4 Niagara
21.5.5 Adapteva
21.5.6 TOP500 and GREEN500
21.6 Other approaches to high-performance big data
21.6.1 Field-programmable gate arrays
21.6.2 Vector processing
21.6.3 Accelerators
21.6.4 Graphics processing units
21.7 Conclusion and future directions
21.7.1 Programming models
21.7.2 Reducing manual effort
21.7.3 Suitable architectures and microarchitectures
21.7.4 Memory-system advancements
21.7.5 Replacing commodity hardware
21.7.6 Latency
21.7.7 Workload heterogeneity
References
22 Biologically-inspired massively-parallel computing
22.1 In the beginning…
22.2 Where are we now?
22.3 So what is the problem?
Microchip technology
Computer architecture
Deep networks
22.4 Biology got there first
Observations of biological systems
22.5 Bioinspired computer architecture
22.6 SpiNNaker – a spiking neural network architecture
22.6.1 SpiNNaker chip
22.6.2 SpiNNaker Router
22.6.3 SpiNNaker board
22.6.4 SpiNNaker machines
22.7 SpiNNaker applications
22.7.1 Biological neural networks
22.7.2 Artificial neural networks
22.7.3 Other application domains
22.8 Conclusion and future directions
Acknowledgements
References
Index