Application-Based Fault Tolerance Techniques for Fully Protecting Sparse Matrix Solvers

Grzegorz Pawelczak, Simon McIntosh-Smith, James Price, Matt Martineau

Research output: Chapter in Book/Report/Conference proceedingConference Contribution (Conference Proceeding)

2 Citations (Scopus)
210 Downloads (Pure)

Abstract

The continuous growth of high-performance computing (HPC) systems has lead to Fault Tolerance (FT) being identified as one of the major challenges for exascale computing, due to the expected decrease in Mean Time Between Failures (MTBF). One source of faults are soft errors, which can cause bit corruptions to the data held in memory. Current solutions for protection against these errors include hardware Error Correcting Codes (ECC), which incur overheads in power, memory bandwidth and storage, while also introducing more complexity to the hardware. In this paper we demonstrate Application-Based Fault Tolerance (ABFT) as an alternative method of protecting sparse matrices and dense vectors from data corruptions, requiring no additional dedicated memory storage. We use TeaLeaf, a heat conduction miniapp from the Mantevo Project, to demonstrate how these ABFT techniques can be adapted and applied to a sparse matrix solver-based application and its underlying data structures in order to improve reliability and performance.
Original languageEnglish
Title of host publication2017 IEEE International Conference on Cluster Computing (CLUSTER 2017)
Subtitle of host publicationProceedings of a meeting held 5-8 September 2017, Honolulu, Hawaii, USA
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages733-740
Number of pages8
ISBN (Electronic)9781538623268
ISBN (Print)9781538623275
DOIs
Publication statusPublished - Oct 2017
Event2017 IEEE International Conference on Cluster Computing, CLUSTER 2017 - Honolulu, United States
Duration: 5 Sep 20178 Sep 2017

Publication series

Name
ISSN (Electronic)2168-9253

Conference

Conference2017 IEEE International Conference on Cluster Computing, CLUSTER 2017
CountryUnited States
CityHonolulu
Period5/09/178/09/17

Keywords

  • Exascale
  • Fault Tolerance
  • Linear Sparse Matrix Solvers
  • Resilience

Fingerprint

Dive into the research topics of 'Application-Based Fault Tolerance Techniques for Fully Protecting Sparse Matrix Solvers'. Together they form a unique fingerprint.

Cite this