coordinated checkpointing

Providing Fault-Tolerance in Unreliable Grid Systems Through Adaptive Checkpointing and Replication

2007

Maria Chtepen Filip H. A. Claeys Bart Dhoedt Filip De Turck Peter A. Vanrolleghem Piet Demeester

As grids typically consist of autonomously managed subsystems with strongly varying resources, fault-tolerance forms an important aspect of the scheduling process of applications. Two well-known techniques for providing fault-tolerance in grids are periodic task checkpointing and replication. Both techniques mitigate the amount of work lost due to changing system availability but can introduce ...

متن کامل

Binomial Checkpointing for Arbitrary Programs with No User Annotation

Journal: :CoRR 2016

Jeffrey Mark Siskind Barak A. Pearlmutter

Heretofore, automatic checkpointing at procedure-call boundaries [1], to reduce the space complexity of reverse mode, has been provided by systems like Tapenade [2]. However, binomial checkpointing, or treeverse [3], has only been provided in AD systems in special cases, e.g., through user-provided pragmas on DO loops in Tapenade, or as the nested taping mechanism in adol-c for time integration...

متن کامل

Self-stabilizing algorithm for checkpointing in a distributed system

Journal: :J. Parallel Distrib. Comput. 2007

Partha Sarathi Mandal Krishnendu Mukhopadhyaya

If the variables used for a checkpointing algorithm have data faults, the existing checkpointing and recovery algorithms may fail. In this paper, self-stabilizing data fault detecting and correcting, checkpointing, and recovery algorithms are proposed in a ring topology. The proposed data fault detection and correction algorithms can handle data faults; at most one per process, but in any numbe...

متن کامل

A Scalable Asynchronous Replication-Based Strategy for Fault Tolerant MPI Applications

2007

John Paul Walters Vipin Chaudhary

As computational clusters increase in size, their mean-time-to-failure reduces. Typically checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require a central storage for storing checkpoints. This severely limits the scalability of checkpointing. We propose a scalable replication-based MPI checkpointing facility that is based on LAM/MPI. We extend...

متن کامل

Checkpointing Orchestrated Web Services

2012

A.Vani Vathsala Hrushikesha Mohanty

Web Services are built on service-oriented architecture which is based on the notion of building applications by discovering and orchestrating services available on the web. Complex business processes can be realized by discovering and orchestrating already available services on the web. In order to make these orchestrated web services resilient to faults; we proposed a simple and elegant check...

متن کامل

The Performance of Consistent Checkpointing in Distributed Shared Memory Systems

1995

Gilbert Cabillic Gilles Muller Isabelle Puaut

This paper presents the design and implementation of a consistent checkpointing scheme for Distributed Shared Memory (dsm) systems. Our approach relies on the integration of checkpoints within synchronization barriers already existing in applications; this avoids the need to introduce an additional synchronization mechanism. The main advantage of our checkpoint-ing mechanism is that performance...

متن کامل

On Energy-efficient Checkpointing in High-throughput Cycle-stealing Distributed Systems

2014

Matthew Forshaw A. Stephen McGough Nigel Thomas

Checkpointing is a fault-tolerance mechanism commonly used in High Throughput Computing (HTC) environments to allow the execution of long-running computational tasks on compute resources subject to hardware and software failures and interruptions from resource owners. With increasing scrutiny of the energy consumption of IT infrastructures, it is important to understand the impact of checkpoint...

متن کامل

A Low Overhead Recovery Technique Using Quasi-Synchronous Checkpointing

1996

D. Manivannan Mukesh Singhal

In this paper, we propose a quasi-synchronous checkpointing algorithm and a low-overhead recovery algorithm based on it. The checkpointing algorithm preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-induced checkpoint coordination for the progression of the recovery line which helps bound rollback propagation during a recovery. Thus, it has th...

متن کامل

Efficient and Flexible Sensornet Checkpointing

2014

Andreas Löscher Nicolas Tsiftes Thiemo Voigt Vlado Handziski

Developing sensornet software is difficult partly because of the limited visibility of the system state of deployed nodes. Sensornet checkpointing is a method that allows developers to save and restore full system state of nodes. We present four extensions to sensornet checkpointing—compression, binary diffs, selective checkpointing, and checkpoint inspection—that reduce the time required for c...

متن کامل

Logging and Recovery in Adaptive Software Distributed Shared Memory Systems

1999

Angkul Kongmunvattana Nian-Feng Tzeng

Software distributed shared memory (DSM) improves the programmability of message-passing machines and workstation clusters by providing a shared memory abstract (i.e., a coherent global address space) to programmers. As in any distributed system, however, the probability of software DSM failures increases as the system size grows. This paper presents a new, efficient logging protocol for adapti...

متن کامل