Abstract
Fault tolerance has been identified as one of the major challenges for exascale computing. In addition to fail-stop errors, silent data corruptions (SDCs) can perturb applications and produce incorrect results. Software-based fault tolerance mechanisms have the advantage of being capable of leveraging some of the properties of the applications to improve their reliability. In this poster, we present a fault tolerance framework that implements multiple resiliency schemes to cope with both fail-stop errors and data corruption. Our techniques are tested with two real scientific applications: BUDE, a molecular docking engine, and TeaLeaf, a heat conduction code. Using this frame- work we have successfully detected and recovered from real data corruptions. We have also performed error injection experiments, which clearly demonstrated the efficacy of our framework.
Original language | English |
---|---|
Publication status | Published - 15 Nov 2016 |
Event | The International Conference for High Performance Computing, Networking, Storage and Analysis - Salt Lake City, United States Duration: 13 Nov 2016 → … http://sc16.supercomputing.org |
Conference
Conference | The International Conference for High Performance Computing, Networking, Storage and Analysis |
---|---|
Abbreviated title | SC'16 |
Country/Territory | United States |
City | Salt Lake City |
Period | 13/11/16 → … |
Internet address |