نتایج جستجو برای: checkpointing

تعداد نتایج: 2665  

Journal: :IEEE Trans. Computers 2001
Yibei Ling Jie Mi Xiaola Lin

ÐCheckpointing is an effective fault-tolerant technique for improving system availability and reliability. However, a blind checkpointing placement can result in either performance degradation or expensive recovery cost. By means of the calculus of variations, we derive an explicit formula that links the optimal checkpointing frequency with a general failure rate, with the objective of globally...

2012
Vaithiyanathan Sundaram

Improving fault tolerance within the clusters has become vital because of the drastic decrease in the Mean Time Between Failures (MTBF) in complex clusters. Checkpointing is one of the robust ways to improve fault tolerance by rolling back from a saved state in the event of a failure in the cluster. Since, checkpointing primarily relies on storage devices for storing the states at regular inter...

Journal: :Journal of Parallel and Distributed Computing 2014

Journal: :journal of computer and robotics 0

mobile computing systems are made up of different components among which mobile support stations (msss) play a key role. this paper proposes an efficient mss-based non-blocking coordinated checkpointing scheme for mobile computing environment. in the scheme suggested nearly all aspects of checkpointing and their related overheads are forwarded to the msss and as a result the workload of mobile ...

2007
Taesoon Park Heon Y. Yeom

This paper presents a new checkpointing coordination scheme which utilizes the communication pattern of the cooperating processes. In the proposed scheme, the checkpointing is coordinated for the limited number of processes based on the information regarding the communication pattern of the target program. Unlike the previous solutions which do not utilize the communication pattern, it is possi...

2015
Guillaume Aupy Yves Robert

In this chapter, we present scheduling algorithms to cope with faults on large-scale parallel platforms. We study checkpointing and show how to derive the optimal checkpointing period. Then we explain how to combine checkpointing with fault prediction, and discuss how the optimal period is modified when this combination is used. And finally we follow the very same approach for the combination o...

2010
Zizhong Chen

Checkpointing is a typical approach to tolerate failures in today’s supercomputing clusters and computational grids. Checkpoint data can be saved either in central stable storage, or in processor memory (as in diskless checkpointing), or local disk space (replacing memory with local disk in diskless checkpointing). But where to save the checkpoint data has a great impact on the performance of a...

2007
Maria Chtepen Filip H. A. Claeys Bart Dhoedt Filip De Turck Peter A. Vanrolleghem Piet Demeester

As grids typically consist of autonomously managed subsystems with strongly varying resources, fault-tolerance forms an important aspect of the scheduling process of applications. Two well-known techniques for providing fault-tolerance in grids are periodic task checkpointing and replication. Both techniques mitigate the amount of work lost due to changing system availability but can introduce ...

Journal: :CoRR 2016
Jeffrey Mark Siskind Barak A. Pearlmutter

Heretofore, automatic checkpointing at procedure-call boundaries [1], to reduce the space complexity of reverse mode, has been provided by systems like Tapenade [2]. However, binomial checkpointing, or treeverse [3], has only been provided in AD systems in special cases, e.g., through user-provided pragmas on DO loops in Tapenade, or as the nested taping mechanism in adol-c for time integration...

Journal: :J. Parallel Distrib. Comput. 2007
Partha Sarathi Mandal Krishnendu Mukhopadhyaya

If the variables used for a checkpointing algorithm have data faults, the existing checkpointing and recovery algorithms may fail. In this paper, self-stabilizing data fault detecting and correcting, checkpointing, and recovery algorithms are proposed in a ring topology. The proposed data fault detection and correction algorithms can handle data faults; at most one per process, but in any numbe...

نمودار تعداد نتایج جستجو در هر سال

با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید