The continuous growth of high-performance computing (HPC) systems has lead to Fault Tolerance (FT) being identified as one of the major challenges for exascale computing, due to the expected decrease in Mean Time Between Failures (MTBF). One source of faults are soft errors, which can cause bit corruptions to the data held in memory. Current solutions for protection against these errors include hardware Error Correcting Codes (ECC), which incur overheads in power, memory bandwidth and storage, while also introducing more complexity to the hardware. In this paper we demonstrate Application-Based Fault Tolerance (ABFT) as an alternative method of protecting sparse matrices and dense vectors from data corruptions, requiring no additional dedicated memory storage. We use TeaLeaf, a heat conduction miniapp from the Mantevo Project, to demonstrate how these ABFT techniques can be adapted and applied to a sparse matrix solver-based application and its underlying data structures in order to improve reliability and performance.
|Conference||2017 IEEE International Conference on Cluster Computing, CLUSTER 2017|
|Period||5/09/17 → 8/09/17|
- Fault Tolerance
- Linear Sparse Matrix Solvers