TY - JOUR
T1 - Instantaneous Mean-Time-To-Failure (MTTF)estimation for checkpoint interval computation at run time
AU - Bandan, Mohamad Imran bin
AU - Bhattacharjee, Subhasis
AU - Jali, Suriati Khartini
AU - Pradhan, Dhiraj K.
PY - 2019/7/1
Y1 - 2019/7/1
N2 - The Mean-Time-To-Failure (MTTF)is an important parameter that determines the life-time reliability of a system. It is being used in several fault-tolerant mechanisms to take a critical decision on processor/system state. Recently it has been found that the MTTF of a system varies with the environmental conditions, in contrary to the earlier belief of a constant MTTF for electronic chips. Thus there is a need for a good and fast estimate of the MTTF that can accommodate the variation of environmental conditions and the stresses on the system. This paper presents an instantaneous MTTF estimation technique to be executed at runtime of the system. A major contribution of this paper is proposing a simple technique to obtain the MTTF for checkpoint interval computation in real-time systems. Our complete system model consisting of multi-level steps are presented as the main model for the MTTF estimation. We adopt one of the state-of-the-art solutions to obtain the aging rate parameter for the host/processor. Also, we proposed another parameter in the MTTF computation that represents the workload and the stress factor of the running host. The results show that the differences are marginal and they lie between 0.014% and 0.131% compared to other MTTF estimation techniques. Also, we showed that the proposed technique is able to capture the temperature variation effect (towards the MTTF value)during several simulated runtime scenarios. The proposed MTTF estimation technique has been incorporated in the life-time reliability-aware checkpointing mechanism and it has been shown to work excellently without violating the task deadlines in all cases.
AB - The Mean-Time-To-Failure (MTTF)is an important parameter that determines the life-time reliability of a system. It is being used in several fault-tolerant mechanisms to take a critical decision on processor/system state. Recently it has been found that the MTTF of a system varies with the environmental conditions, in contrary to the earlier belief of a constant MTTF for electronic chips. Thus there is a need for a good and fast estimate of the MTTF that can accommodate the variation of environmental conditions and the stresses on the system. This paper presents an instantaneous MTTF estimation technique to be executed at runtime of the system. A major contribution of this paper is proposing a simple technique to obtain the MTTF for checkpoint interval computation in real-time systems. Our complete system model consisting of multi-level steps are presented as the main model for the MTTF estimation. We adopt one of the state-of-the-art solutions to obtain the aging rate parameter for the host/processor. Also, we proposed another parameter in the MTTF computation that represents the workload and the stress factor of the running host. The results show that the differences are marginal and they lie between 0.014% and 0.131% compared to other MTTF estimation techniques. Also, we showed that the proposed technique is able to capture the temperature variation effect (towards the MTTF value)during several simulated runtime scenarios. The proposed MTTF estimation technique has been incorporated in the life-time reliability-aware checkpointing mechanism and it has been shown to work excellently without violating the task deadlines in all cases.
KW - Failure rate based checkpoint interval computation
KW - Lifetime reliability
KW - Mean-Time-To-Failure
KW - MTTF
KW - Reliability
UR - http://www.scopus.com/inward/record.url?scp=85065235027&partnerID=8YFLogxK
U2 - 10.1016/j.microrel.2019.04.009
DO - 10.1016/j.microrel.2019.04.009
M3 - Article (Academic Journal)
AN - SCOPUS:85065235027
SN - 0026-2714
VL - 98
SP - 69
EP - 77
JO - Microelectronics Reliability
JF - Microelectronics Reliability
ER -