Theft-Induced Checkpointing for Reconfigurable Dataflow Applications

 Authors

Samir Jafar† , Axel W. Krings, Thierry Gautier, and Jean-Louis Roch

Abstract

In this paper a new checkpoint/recovery protocol called
Theft-Induced Checkpointing is defined for dataflow computations in large heterogeneous environments. The protocol is especially useful in massively parallel multi-threaded
computations as found in cluster or grid computing and utilizes the principle of work-stealing to distribute work. By
basing the state of executions on a macro dataflow graph,
the protocol shows extreme flexibility with respect to rollback. Specifically, it allows local rollback in dynamic heterogeneous systems, even under a different number of processors and processes. To maximize run-time efficiency,
the overhead associated with checkpointing is shifted to the
rollback operations whenever possible. Experimental results show the overhead induced is very small.

الملفات المرفقة

Syrian Private University - Faculty of of Computer and Information Engineering @ 2024 by Syrian Monster - Web Service Provider | All Rights Reserved