Fault-Tolerance for Macro Dataflow Parallel Computations on Grid

 Authors

Samir JAFAR∗and Jean-Louis ROCH

Abstract

Large scale cluster and grid computer systems gather thousands of nodes for computing parallel applications. At this scale, component failures or disconnections are normal part of operation, and applications have to deal more directly with repeated failures during program runs. In this paper, we present a portable fault tolerant mechanism for execution of macro dataflow parallel programs on a large scale distributed and heteogeneous grid including SMP nodes. Our mechanism is based on a portable checkpoint-rollback and supports both parallel programs with dependencies and addition or resilience of heterogeneous resources. We have implemented this mechanism on top of Athapascan programming interface and experimental results are presented

 

الملفات المرفقة

Syrian Private University - Scentafic Research @ 2024 by Syrian Monster - Web Service Provider | All Rights Reserved