Software-level Fault Tolerant Framework for Task-based Applications

Joy Yeh, George P Pawelczak, James Sewart, James Price, Amaurys Avila Ibarra, Simon McIntosh-Smith, Leonardo Bautista-Gomez, Ferad Zyulkyarov

Research output: Contribution to conferenceConference Posterpeer-review

Abstract

Fault tolerance has been identified as one of the major challenges for exascale computing. In addition to fail-stop errors, silent data corruptions (SDCs) can perturb applications and produce incorrect results. Software-based fault tolerance mechanisms have the advantage of being capable of leveraging some of the properties of the applications to improve their reliability. In this poster, we present a fault tolerance framework that implements multiple resiliency schemes to cope with both fail-stop errors and data corruption. Our techniques are tested with two real scientific applications: BUDE, a molecular docking engine, and TeaLeaf, a heat conduction code. Using this frame- work we have successfully detected and recovered from real data corruptions. We have also performed error injection experiments, which clearly demonstrated the efficacy of our framework.
Original languageEnglish
Publication statusPublished - 15 Nov 2016
EventThe International Conference for High Performance Computing, Networking, Storage and Analysis - Salt Lake City, United States
Duration: 13 Nov 2016 → …
http://sc16.supercomputing.org

Conference

ConferenceThe International Conference for High Performance Computing, Networking, Storage and Analysis
Abbreviated titleSC'16
CountryUnited States
CitySalt Lake City
Period13/11/16 → …
Internet address

Fingerprint Dive into the research topics of 'Software-level Fault Tolerant Framework for Task-based Applications'. Together they form a unique fingerprint.

Cite this