checkpointing

Recomputation Enabled Efficient Checkpointing

Journal: :CoRR 2017

Ismail Akturk Ulya R. Karpuzcu

Systematic checkpointing of the machine state makes restart of execution from a safe state possible upon detection of an error. The time and energy overhead of checkpointing, however, grows with the frequency of checkpointing. Amortizing this overhead becomes especially challenging, considering the growth of expected error rates, as checkpointing frequency tends to increase with increasing erro...

متن کامل

An Efficient Recovery Mechanism with Checkpointing Approach for Cluster Federation

2014

Manoj Kumar

Checkpoint and recovery protocols are commonly used in distributed applications for providing fault tolerance. A distributed system may require taking checkpoints from time to time to keep it free of arbitrary failures. In case of failure, the system will rollback to checkpoints where global consistency is preserved. Checkpointing is one of the fault-tolerant techniques to restore faults and to...

متن کامل

An Efficient Time-Based Checkpointing Protocol for Mobile Computing Systems over Mobile IP

Journal: :MONET 2003

Chi-Yi Lin Szu-Chi Wang Sy-Yen Kuo

Time-based coordinated checkpointing protocols are well suited for mobile computing systems because no explicit coordination message is needed while the advantages of coordinated checkpointing are kept. However, without coordination, every process has to take a checkpoint during a checkpointing process. In this paper, an efficient time-based coordinated checkpointing protocol for mobile computi...

متن کامل

Checkpointing Orchestration for Performance Improvement

2010

Hui Jin

Checkpointing is a mostly used mechanism for supporting fault tolerance of high performance computing (HPC), but notorious in its expensive disk access. Parallel file systems such as Lustre, GPFS, PVFS are widely deployed on super computers to provide fast I/O bandwidth for general data-intensive applications. However, the unique feature of checkpointing makes it impossible to benefit from the ...

متن کامل

Pii: S0950-5849(99)00057-9

2000

S. K. Woo M. H. Kim Y. J. Lee

In main memory databases, fuzzy checkpointing gives less transaction overhead due to its asynchronous backup feature. However, till now, fuzzy checkpointing has considered only physical logging schemes. The size of the physical log records is very large, and hence it incurs space and recovery processing overhead. In this paper, we propose a recovery method based on a hybrid logging scheme, whic...

متن کامل

Performance analysis of different checkpointing and recovery schemes using stochastic model

Journal: :J. Parallel Distrib. Comput. 2006

Partha Sarathi Mandal Krishnendu Mukhopadhyaya

Several schemes for checkpointing and rollback recovery have been reported in the literature. In this paper, we analyze some of these schemes under a stochastic model. We have derived expressions for average cost of checkpointing, rollback recovery, message logging and piggybacking with application messages in synchronous as well as asynchronous checkpointing. For quasi-synchronous checkpointin...

متن کامل

On Real-Time Quasi-Durable Checkpointing

1996

Jiandong Huang Peng-Jun Wan Vicraj Thomas

This study investigates real-time checkpointing techniques in the context of distributed process control applications where checkpointing and recovery operations must meet timing constraints, such as process deadline and plant state validity. We introduce the notion of quasidurability, which allows one to rnake trudeoffs between storage device reliability and the process control and recovery ti...

متن کامل

Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems

Journal: :J. Parallel Distrib. Comput. 2001

James S. Plank Michael G. Thomason

Performance prediction of checkpointing systems in the presence of failures is a well-studied research area. While the literature abounds with performance models of checkpointing systems, none address the issue of selecting runtime parameters other than the optimal checkpointing interval. In particular, the issue of processor allocation is typically ignored. In this paper, we present a performa...

متن کامل

Application-Level Checkpointing Techniques for Parallel Programs

2006

John Paul Walters Vipin Chaudhary

In its simplest form, checkpointing is the act of saving a program’s computation state in a form external to the running program, e.g. the computation state is saved to a filesystem. The checkpoint files can then be used to resume computation upon failure of the original process(s), hopefully with minimal loss of computing work. A checkpoint can be taken using a variety of techniques in every l...

متن کامل

A checkpointing-recovery scheme for domino free distributed systems

2000

Francesco QUAGLIA Bruno CICIANI Roberto BALDONI

Many communication induced checkpointing algorithms have been proposed for asynchronous cooperating processes. All of them suffer from overhead due both to the exchange of control information and to the insertion of local checkpoints additional to the basic ones. In this paper we propose a low overhead checkpointing-recovery scheme. It consists of a domino-free checkpointing algorithm plus an a...

متن کامل