Un protocole de sauvegarde / reprise coordonné pour les applications à flot de données reconfigurables

Authors

Xavier Besseron — Laurent Pigeon1 — Thierry Gautier — Samir Jafar2

Abstract

Fault tolerance protocols play an important role in today long runtime scientific parallel applications because the probability of failure may be important due to the number of unreliable components involved during simulation. In this paper we present our approach and preliminary results about a new checkpoint/recovery protocol based on a coordinated scheme.One feature of this protocol is that recovery after a fault only requires a partial restart of other processes. This protocol is highly coupled to the availability of an abstract representation of the execution

Keywords

grid, fault tolerance, parallel computing, dataflow graph

الملفات المرفقة