A Checkpoint/Recovery Model for Heterogeneous Dataflow Computations Using Work-Stealing

 

Authors

Samir JAFAR†, Thierry GAUTIER†, Axel KRINGS‡, Jean-Louis ROCH

Abstract

This paper presents a new checkpoint/recovery method for dataflow computations using work-stealing in heterogeneous environments as found in grid or cluster computing. Basing the state of the computation on a dynamic macro dataflow graph, it is shown that the mechanisms provide effective checkpointing for multithreaded applications in heterogeneous environments. Two methods, Systematic Event Logging and Theft-Induced Checkpointing, are presented that are efficient and extremely flexible under the system-state model, allowing for recovery on different platforms under different number of processors. A formal analysis of the overhead induced by both methods is presented, followed byan experimental evaluation in a large cluster. It is shown that both methods have very small overhead and that trade-offs between checkpointing and recovery cost can be controlled

 

 

Attachement Files

Syrian Private University - Scentafic Research @ 2024 by Syrian Monster - Web Service Provider | All Rights Reserved