Abstract
Checkpointing mechanism is used to tolerate the
impact of transient faults through roll-back operation to a
previously saved system state. In this paper, we propose a novel
checkpointing mechanism that considers fault tolerance in a
duplex system in the presence of both transient and permanent
faults. The main objective of our proposed mechanism is to
extend the lifetime reliability of the duplex system by avoiding
or even tolerating permanent faults in microprocessors. In
addition, we also propose to migrate tasks from a ’near-todie’
processor to a spare processor under a condition where
the current Mean-Time-To-Failure (MTTF) value is less or
equal to a pre-determined threshold MTTF value. We validate
our proposed mechanism and perform overhead analysis using
various case studies. Later, we compare it with one of the most
popular existing checkpointing mechanism, namely the rollforward
checkpointing scheme [9]. We show that unlike rollback
or roll-forward mechanisms, our proposed mechanism
gives significantly higher lifetime reliability with reasonable
system overheads
impact of transient faults through roll-back operation to a
previously saved system state. In this paper, we propose a novel
checkpointing mechanism that considers fault tolerance in a
duplex system in the presence of both transient and permanent
faults. The main objective of our proposed mechanism is to
extend the lifetime reliability of the duplex system by avoiding
or even tolerating permanent faults in microprocessors. In
addition, we also propose to migrate tasks from a ’near-todie’
processor to a spare processor under a condition where
the current Mean-Time-To-Failure (MTTF) value is less or
equal to a pre-determined threshold MTTF value. We validate
our proposed mechanism and perform overhead analysis using
various case studies. Later, we compare it with one of the most
popular existing checkpointing mechanism, namely the rollforward
checkpointing scheme [9]. We show that unlike rollback
or roll-forward mechanisms, our proposed mechanism
gives significantly higher lifetime reliability with reasonable
system overheads
Original language | English |
---|---|
Title of host publication | Lifetime Reliability-Aware Checkpointing Mechanism: Modelling and Analysis |
Publication status | Published - 2013 |