Abstract
In this research we describe the development and optimisation of a new Monte Carlo neutral particle transport mini-app, neutral. In spite of the success of previous research efforts to load balance the algorithm at scale, it is not clear how to take advantage of the diverse architectures being installed in the newest supercomputers. We explore different algorithmic approaches, and perform extensive investigations into the performance of the application on modern hardware including Intel Xeon and Xeon Phi CPUs, POWER8 CPUs, and NVIDIA GPUs.
When applied to particle transport the Monte Carlo method is not embarrassingly parallel, as might be expected, due to dependencies on the computational mesh that expose random memory access patterns. The algorithm requires the use of atomic operations, and exhibits load imbalance at the node-level due to the random branching of particle histories. The algorithmic characteristics make it challenging to exploit the high memory bandwidth and FLOPS of modern HPC architectures.
Both of the parallelisation schemes discussed in this paper are dominated by the atomic operation required for tallying calculations, and suffer from latency issues caused by poor data locality. We saw a significant improvement in performance through the use of hyperthreading on all CPUs and best performance on the NVIDIA P100 GPU. A key observation is that architectures that are tolerant to latencies may be able to hide the negative properties of the algorithms.
When applied to particle transport the Monte Carlo method is not embarrassingly parallel, as might be expected, due to dependencies on the computational mesh that expose random memory access patterns. The algorithm requires the use of atomic operations, and exhibits load imbalance at the node-level due to the random branching of particle histories. The algorithmic characteristics make it challenging to exploit the high memory bandwidth and FLOPS of modern HPC architectures.
Both of the parallelisation schemes discussed in this paper are dominated by the atomic operation required for tallying calculations, and suffer from latency issues caused by poor data locality. We saw a significant improvement in performance through the use of hyperthreading on all CPUs and best performance on the NVIDIA P100 GPU. A key observation is that architectures that are tolerant to latencies may be able to hide the negative properties of the algorithms.
Original language | English |
---|---|
Title of host publication | 2017 IEEE International Conference on Cluster Computing (CLUSTER 2017) |
Subtitle of host publication | Proceedings of a meeting held 5-8 September 2017, Honolulu, Hawaii, USA |
Publisher | Institute of Electrical and Electronics Engineers (IEEE) |
Pages | 498-508 |
Number of pages | 11 |
ISBN (Electronic) | 9781538623268 |
ISBN (Print) | 9781538623275 |
DOIs | |
Publication status | Published - Oct 2017 |
Event | 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017 - Honolulu, United States Duration: 5 Sept 2017 → 8 Sept 2017 |
Publication series
Name | |
---|---|
ISSN (Electronic) | 2168-9253 |
Conference
Conference | 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017 |
---|---|
Country/Territory | United States |
City | Honolulu |
Period | 5/09/17 → 8/09/17 |
Keywords
- Mini-App
- Monte-carlo-particle-Transport
- Performance-optimisation