Fault tolerance has been identified as one of the major challenges for exascale computing. In addition to fail-stop errors, silent data corruptions (SDCs) can perturb applications and produce incorrect results. Software-based fault tolerance mechanisms have the advantage of being capable of leveraging some of the properties of the applications to improve their reliability. In this poster, we present a fault tolerance framework that implements multiple resiliency schemes to cope with both fail-stop errors and data corruption. Our techniques are tested with two real scientific applications: BUDE, a molecular docking engine, and TeaLeaf, a heat conduction code. Using this frame- work we have successfully detected and recovered from real data corruptions. We have also performed error injection experiments, which clearly demonstrated the efficacy of our framework.
|Publication status||Published - 15 Nov 2016|
|Event||The International Conference for High Performance Computing, Networking, Storage and Analysis - Salt Lake City, United States|
Duration: 13 Nov 2016 → …
|Conference||The International Conference for High Performance Computing, Networking, Storage and Analysis|
|City||Salt Lake City|
|Period||13/11/16 → …|