Correcting Detectable Uncorrectable Errors in Memory

Grzegorz Pawelczak, Simon McIntosh-Smith

Research output: Contribution to conferenceConference Posterpeer-review

105 Downloads (Pure)

Abstract

With the expected decrease in Mean Time Between Failures (MTBF), Fault Tolerance (FT) has been identi ed as one of the major chal- lenges for exascale computing. One source of faults are soft errors caused by cosmic rays, which can cause bit corruptions to the data held in memory. Current solutions for protection against these er- rors include Error Correcting Codes (ECC), which can detect and/or correct these errors. When an error that can be detected but not corrected occurs, a Detectable Uncorrectable Error (DUE) results, and unless checkpoint-restart is used, the system will usually fail. In our work we present a probabilistic method of correcting DUEs which occur in the part of the memory where the program instruc- tions are stored. We devise a correction technique for DUEs for the ARM A64 instruction set which combines extended Hamming code with Cyclic Redundancy Check (CRC) code to provide near 100% Successful Correction Rate (SCR) of DUEs.
Original languageEnglish
Publication statusPublished - 12 Nov 2017
EventSC17 - Colorado Convention Center, Denver, United States
Duration: 12 Nov 201717 Nov 2017
http://sc17.supercomputing.org/

Conference

ConferenceSC17
CountryUnited States
CityDenver
Period12/11/1717/11/17
OtherSC Conference is dedicated to showcasing work in high performance computing, networking, storage and analysis by the international HPC community.
Internet address

Fingerprint Dive into the research topics of 'Correcting Detectable Uncorrectable Errors in Memory'. Together they form a unique fingerprint.

Cite this