Application-Based Fault Tolerance Techniques for Fully Protecting Sparse Matrix Solvers

Grzegorz Pawelczak, Simon McIntosh-Smith, James Price, Matt Martineau

Research output: Chapter in Book/Report/Conference proceedingConference Contribution (Conference Proceeding)

2 Citations (Scopus)
305 Downloads (Pure)


The continuous growth of high-performance computing (HPC) systems has lead to Fault Tolerance (FT) being identified as one of the major challenges for exascale computing, due to the expected decrease in Mean Time Between Failures (MTBF). One source of faults are soft errors, which can cause bit corruptions to the data held in memory. Current solutions for protection against these errors include hardware Error Correcting Codes (ECC), which incur overheads in power, memory bandwidth and storage, while also introducing more complexity to the hardware. In this paper we demonstrate Application-Based Fault Tolerance (ABFT) as an alternative method of protecting sparse matrices and dense vectors from data corruptions, requiring no additional dedicated memory storage. We use TeaLeaf, a heat conduction miniapp from the Mantevo Project, to demonstrate how these ABFT techniques can be adapted and applied to a sparse matrix solver-based application and its underlying data structures in order to improve reliability and performance.
Original languageEnglish
Title of host publication2017 IEEE International Conference on Cluster Computing (CLUSTER 2017)
Subtitle of host publicationProceedings of a meeting held 5-8 September 2017, Honolulu, Hawaii, USA
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Number of pages8
ISBN (Electronic)9781538623268
ISBN (Print)9781538623275
Publication statusPublished - Oct 2017
Event2017 IEEE International Conference on Cluster Computing, CLUSTER 2017 - Honolulu, United States
Duration: 5 Sept 20178 Sept 2017

Publication series

ISSN (Electronic)2168-9253


Conference2017 IEEE International Conference on Cluster Computing, CLUSTER 2017
Country/TerritoryUnited States


  • Exascale
  • Fault Tolerance
  • Linear Sparse Matrix Solvers
  • Resilience


Dive into the research topics of 'Application-Based Fault Tolerance Techniques for Fully Protecting Sparse Matrix Solvers'. Together they form a unique fingerprint.

Cite this