Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer

Leonardo Bautista-Gomez, Ferad Zyulkyarov, Osman Unsal, Simon McIntosh-Smith

Research output: Chapter in Book/Report/Conference proceedingConference Contribution (Conference Proceeding)

66 Citations (Scopus)

Abstract

Supercomputers offer new opportunities for scientific computing as they grow in size. However, their growth also poses new challenges. Resilience has been recognized as one of the most pressing issues to solve for extreme scale computing. Transistor scaling in the single-digit nanometer era and power constraints might dramatically increase the failure rate of next generation machines. DRAM errors have been analyzed in the past for different supercomputers but those studies are usually based on job scheduler logs and counters produced by hardware-level error correcting codes. Consequently, little is known about errors escaping hardware checks, which lead to silent data corruption. This work attempts to fill that gap by analyzing memory errors for over a year on a cluster with about 1000 nodes featuring low-power memory without error correction. The study gathered millions of events recording detailed information of thousands of memory errors, many of them corrupting multiple bits. Several factors are analyzed, such as temporal and spatial correlation between errors, but also the influence of temperature and even the position of the sun in the sky. The study showed that most multi-bit errors corrupted non-adjacent bits in the memory word and that most errors flipped memory bits from 1 to 0. In addition, we observed thousands of cases of multiple single-bit errors occurring simultaneously in different regions of the memory. These new observations would not be possible by simply analyzing error correction counters on classical systems. We propose several directions in which the findings of this study can help the design of more reliable systems in the future.

Original languageEnglish
Title of host publicationProceedings of SC 2016
Subtitle of host publicationThe International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherIEEE Computer Society
Pages645-655
Number of pages11
Volume0
Edition2 July 2016
ISBN (Electronic)9781467388153
DOIs
Publication statusPublished - 13 Nov 2016
Event2016 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016 - Salt Lake City, United States
Duration: 13 Nov 201618 Nov 2016

Publication series

Name
ISSN (Electronic)2167-4337

Conference

Conference2016 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016
Country/TerritoryUnited States
CitySalt Lake City
Period13/11/1618/11/16

Fingerprint

Dive into the research topics of 'Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer'. Together they form a unique fingerprint.

Cite this