A checkpoint/recovery Model based on work stealing for grid applications

 Authors

Rahaf Ghazal , Samir Jafar

Abstract

The study is researching the fault tolerance in the large distributed environments such as grid computing and clusters of computers in order to find themost effective ways to deal with the errors associated with the crash one of the devices in the environment or network disconnection to ensure the continuity of the application in the presence of the faults. In this paper we study a model of the distributed environment and the parallel applications within it. Then we provide a checkpoint mechanism that will enable us to ensure continuity of the work used by a virtual representation of the application (macrodataflow) and suitable for the applications which uses work stealing algorithm to distribute the tasks which are implemented in heterogeneous and dynamic environment.This mechanism will add a simple cost to the cost of parallel execution as a result of keeping part of the work during faultfree execution. The study also provides a mathematical model to calculate the time complexity i.e. the cost of this proposedmechanism. 

Keywords

grid computing, macro data flow, work stealing, fault tolerance, checkpointing, parallel programming

الملفات المرفقة

Syrian Private University - Scentafic Research @ 2024 by Syrian Monster - Web Service Provider | All Rights Reserved