Instantaneous Mean-Time-To-Failure (MTTF)estimation for checkpoint interval computation at run time

Mohamad Imran bin Bandan*, Subhasis Bhattacharjee, Suriati Khartini Jali, Dhiraj K. Pradhan

*Corresponding author for this work

Research output: Contribution to journalArticle (Academic Journal)peer-review

Abstract

The Mean-Time-To-Failure (MTTF)is an important parameter that determines the life-time reliability of a system. It is being used in several fault-tolerant mechanisms to take a critical decision on processor/system state. Recently it has been found that the MTTF of a system varies with the environmental conditions, in contrary to the earlier belief of a constant MTTF for electronic chips. Thus there is a need for a good and fast estimate of the MTTF that can accommodate the variation of environmental conditions and the stresses on the system. This paper presents an instantaneous MTTF estimation technique to be executed at runtime of the system. A major contribution of this paper is proposing a simple technique to obtain the MTTF for checkpoint interval computation in real-time systems. Our complete system model consisting of multi-level steps are presented as the main model for the MTTF estimation. We adopt one of the state-of-the-art solutions to obtain the aging rate parameter for the host/processor. Also, we proposed another parameter in the MTTF computation that represents the workload and the stress factor of the running host. The results show that the differences are marginal and they lie between 0.014% and 0.131% compared to other MTTF estimation techniques. Also, we showed that the proposed technique is able to capture the temperature variation effect (towards the MTTF value)during several simulated runtime scenarios. The proposed MTTF estimation technique has been incorporated in the life-time reliability-aware checkpointing mechanism and it has been shown to work excellently without violating the task deadlines in all cases.

Original languageEnglish
Pages (from-to)69-77
Number of pages9
JournalMicroelectronics Reliability
Volume98
Early online date9 May 2019
DOIs
Publication statusPublished - 1 Jul 2019

Keywords

  • Failure rate based checkpoint interval computation
  • Lifetime reliability
  • Mean-Time-To-Failure
  • MTTF
  • Reliability

Fingerprint Dive into the research topics of 'Instantaneous Mean-Time-To-Failure (MTTF)estimation for checkpoint interval computation at run time'. Together they form a unique fingerprint.

Cite this