Abstract
With the expected decrease in Mean Time Between Failures (MTBF), Fault Tolerance (FT) has been identi ed as one of the major chal- lenges for exascale computing. One source of faults are soft errors caused by cosmic rays, which can cause bit corruptions to the data held in memory. Current solutions for protection against these er- rors include Error Correcting Codes (ECC), which can detect and/or correct these errors. When an error that can be detected but not corrected occurs, a Detectable Uncorrectable Error (DUE) results, and unless checkpoint-restart is used, the system will usually fail. In our work we present a probabilistic method of correcting DUEs which occur in the part of the memory where the program instruc- tions are stored. We devise a correction technique for DUEs for the ARM A64 instruction set which combines extended Hamming code with Cyclic Redundancy Check (CRC) code to provide near 100% Successful Correction Rate (SCR) of DUEs.
Original language | English |
---|---|
Publication status | Published - 12 Nov 2017 |
Event | SC17 - Colorado Convention Center, Denver, United States Duration: 12 Nov 2017 → 17 Nov 2017 http://sc17.supercomputing.org/ |
Conference
Conference | SC17 |
---|---|
Country/Territory | United States |
City | Denver |
Period | 12/11/17 → 17/11/17 |
Other | SC Conference is dedicated to showcasing work in high performance computing, networking, storage and analysis by the international HPC community. |
Internet address |