The Mean Time Between Failures is expected to decrease as the size and complexity high-performance computing systems continues to increase. With this increase in the number of components, Fault Tolerance (FT) has been identified as one of the major challenges for exascale computing. One source of faults in a system are soft errors caused by cosmic rays, which can cause single or multi bit corruptions to the data held in memory.
Current solutions for protection against soft errors include hardware Error Correcting Codes (ECC). These have some disadvantages, including that they consume more power, require extra memory bandwidth and storage, and they also introduce more complexity to the hardware. Application-Based Fault Tolerance (ABFT) allows us to adapt an Error Detection and Correction technique to the application and its data structure.
In this presentation we demonstrate ABFT techniques for Sparse Matrix Solvers using TeaLeaf, a heat conducting miniapp from the Mantevo Project.
This software-based approach fully protects the sparse matrix stored in the compressed sparse row format and the dense floating point vectors from soft errors. To protect the data we compare different error detection and/or correction methods such as Hamming Codes and CRC, and identify the trade-offs between them. Our solution requires no extra memory storage as the redundant data for the error detection and correction is stored in the least significant bits of the mantissa or in the unused bits from the integer vectors.
We also investigate tradeoff between the error detection/correction interval and the accuracy of the linear solve - performing the test and correction only once per `n’ timesteps can significantly reduce the performance overhead of the approach, at the cost of potentially performing more redundant work after an error has occurred but before it is detected.
In this presentation we will explain the details of these Application-Based Fault Tolerance techniques and how they combat soft errors, as well as presenting performance results on different architectures including x86, ARM, Intel Xeon Phi (Knights Landing) and GPUs.