Authors
Xavier Besseron — Laurent Pigeon1 — Thierry Gautier — Samir Jafar2
Abstract
Fault tolerance protocols play an important role in today long runtime scientific parallel applications because the probability of failure may be important due to the number of unreliable components involved during simulation. In this paper we present our approach and preliminary results about a new checkpoint/recovery protocol based on a coordinated scheme.One feature of this protocol is that recovery after a fault only requires a partial restart of other processes. This protocol is highly coupled to the availability of an abstract representation of the execution
Keywords
grid, fault tolerance, parallel computing, dataflow graph
Syrian Private University - Scentafic Research @ 2024 by Syrian Monster - Web Service Provider | All Rights Reserved