The Mean-Time-To-Failure (MTTF)is an important parameter that determines the life-time reliability of a system. It is being used in several fault-tolerant mechanisms to take a critical decision on processor/system state. Recently it has been found that the MTTF of a system varies with the environmental conditions, in contrary to the earlier belief of a constant MTTF for electronic chips. Thus there is a need for a good and fast estimate of the MTTF that can accommodate the variation of environmental conditions and the stresses on the system. This paper presents an instantaneous MTTF estimation technique to be executed at runtime of the system. A major contribution of this paper is proposing a simple technique to obtain the MTTF for checkpoint interval computation in real-time systems. Our complete system model consisting of multi-level steps are presented as the main model for the MTTF estimation. We adopt one of the state-of-the-art solutions to obtain the aging rate parameter for the host/processor. Also, we proposed another parameter in the MTTF computation that represents the workload and the stress factor of the running host. The results show that the differences are marginal and they lie between 0.014% and 0.131% compared to other MTTF estimation techniques. Also, we showed that the proposed technique is able to capture the temperature variation effect (towards the MTTF value)during several simulated runtime scenarios. The proposed MTTF estimation technique has been incorporated in the life-time reliability-aware checkpointing mechanism and it has been shown to work excellently without violating the task deadlines in all cases.
- Failure rate based checkpoint interval computation
- Lifetime reliability