checkpointing

Performance and Effectiveness Analysis of Checkpointing in Mobile Environments

2003

Xinyu Chen Michael R. Lyu

Many mathematical models have been proposed to evaluate the execution performance of an application with and without checkpointing in the presence of failures. They assume that the total program execution time without failure is known in advance, under which condition the optimal checkpointing interval can be determined. In mobile environments, application components are distributed and tasks a...

متن کامل

Coherence-based Coordinated Checkpointing for Software Distributed Shared Memory Systems

2000

Angkul Kongmunvattana Santipong Tanchatchawal Nian-Feng Tzeng

Fault-tolerant techniques that can cope with system failures in software distributed shared memory (SDSM) are essential for creating productive and highly available parallel computing environments on clusters of workstations. In this paper, we propose a new, efficient coordinated checkpointing technique, called coherence-based coordinated checkpointing (CCC), for SDSM. Our CCC minimizes both th...

متن کامل

Performance and reliability trade-offs for the double checkpointing algorithm

Journal: :IJNC 2014

Jack J. Dongarra Thomas Hérault Yves Robert

Fast checkpointing algorithms require distributed access to stable storage. This paper revisits the approach based upon double checkpointing, and compares the blocking algorithm of Zheng, Shi and Kalé [23], with the non-blocking algorithm of Ni, Meneses and Kalé [15] in terms of both performance and risk. We also extend the model proposedcan provide a better efficiency in [23, 15] to assess the...

متن کامل

User-Level Socket-Based Checkpointing for Distributed and Parallel Computation

Journal: :CoRR 2007

Jason Ansel Michael Rieker Gene Cooperman

We present a preliminary description of a user-level checkpointing package, DMTCP, for Linux. The socket-based approach presents a novel method for checkpointing distributed processes. This includes checkpointing of any dynamically created POSIX threads and forked child processes. It also includes checkpointing of remotely spawned processes via ssh and other mechanisms. As with all user-level c...

متن کامل

Soft-Checkpointing Based Hybrid Synchronous Checkpointing Protocol for Mobile Distributed Systems

Journal: :IJDST 2011

Parveen Kumar Rachit Garg

Minimum-process coordinated checkpointing is a suitable approach to introduce fault tolerance in mobile distributed systems transparently. In order to balance the checkpointing overhead and the loss of computation on recovery, the authors propose a hybrid checkpointing algorithm, wherein an all-process coordinated checkpoint is taken after the execution of minimum-process coordinated checkpoint...

متن کامل

Compiler Supported Interval Optimisation for Communication Induced Checkpointing

2007

Jörg Preißinger Mark Pflüger

There exist mainly three different approaches of checkpoint-based recovery mechanisms for distributed systems: coordinated checkpointing, uncoordinated checkpointing and communication induced checkpointing. It can be shown that communication induced checkpointing theoretically has the least minimum overhead, but also that the effective overhead depends on the communication behaviour and the res...

متن کامل

Incremental Checkpointing based on Java Source Code Refactoring

2005

Thomas Huining Feng

In this project, incremental checkpointing is developed specifically for Java programs. This checkpointing scheme has a flavor of source code refactoring, which performs almost all the (rule-based) transformation automatically, requiring few (or no in many cases) interaction with the programmer. Incremental checkpointing bases on a logging technique that records the change in states instead of ...

متن کامل

down down

1998

James S. Plank Michael G. Thomason

Performance prediction of checkpointing systems in the presence of failures is a well-studied research area. This paper makes three small contributions to this research area. First, we show how to apply the concept of availability from reliability theory as a useful metric for checkpointing systems. Second, we study the average availability of uniprocessor checkpointing systems, using the libck...

متن کامل

Enhanced Two-level Fault Recovery Scheme Combined with Message Logging

2008

Ch. D. V. Subba Rao M. M. Naidu

⎯ Checkpointing schemes facilitate fault recovery in distributed systems. The two-level fault recovery scheme of distributed system inherits the merits of both disk-based and diskless checkpointing schemes. The present work extends James S Plank’s Diskless checkpointing scheme (N+1 Parity) by introducing ‘Timeout’ to checkpoint programs with high locality of reference. This mechanism enables ap...

متن کامل

Efficient Diskless Checkpointing and Log Based Recovery Schemes

2010

Subba Rao Sai Krishna

Checkpointing and message logging are the popular and generalpurpose tools for providing fault tolerance in distributed systems. Diskless checkpointing schemes enable frequent checkpointing without a performance penalty. The present work extends James S Plank‟s Diskless checkpointing scheme (N+1 Parity) by introducing ‘Timeout’ mechanism to checkpoint programs with high locality of reference. T...

متن کامل