نتایج جستجو برای: coordinated checkpointing

تعداد نتایج: 48092  

2007
Maria Chtepen Filip H. A. Claeys Bart Dhoedt Filip De Turck Peter A. Vanrolleghem Piet Demeester

As grids typically consist of autonomously managed subsystems with strongly varying resources, fault-tolerance forms an important aspect of the scheduling process of applications. Two well-known techniques for providing fault-tolerance in grids are periodic task checkpointing and replication. Both techniques mitigate the amount of work lost due to changing system availability but can introduce ...

Journal: :CoRR 2016
Jeffrey Mark Siskind Barak A. Pearlmutter

Heretofore, automatic checkpointing at procedure-call boundaries [1], to reduce the space complexity of reverse mode, has been provided by systems like Tapenade [2]. However, binomial checkpointing, or treeverse [3], has only been provided in AD systems in special cases, e.g., through user-provided pragmas on DO loops in Tapenade, or as the nested taping mechanism in adol-c for time integration...

Journal: :J. Parallel Distrib. Comput. 2007
Partha Sarathi Mandal Krishnendu Mukhopadhyaya

If the variables used for a checkpointing algorithm have data faults, the existing checkpointing and recovery algorithms may fail. In this paper, self-stabilizing data fault detecting and correcting, checkpointing, and recovery algorithms are proposed in a ring topology. The proposed data fault detection and correction algorithms can handle data faults; at most one per process, but in any numbe...

2007
John Paul Walters Vipin Chaudhary

As computational clusters increase in size, their mean-time-to-failure reduces. Typically checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require a central storage for storing checkpoints. This severely limits the scalability of checkpointing. We propose a scalable replication-based MPI checkpointing facility that is based on LAM/MPI. We extend...

2012
A.Vani Vathsala Hrushikesha Mohanty

Web Services are built on service-oriented architecture which is based on the notion of building applications by discovering and orchestrating services available on the web. Complex business processes can be realized by discovering and orchestrating already available services on the web. In order to make these orchestrated web services resilient to faults; we proposed a simple and elegant check...

1995
Gilbert Cabillic Gilles Muller Isabelle Puaut

This paper presents the design and implementation of a consistent checkpointing scheme for Distributed Shared Memory (dsm) systems. Our approach relies on the integration of checkpoints within synchronization barriers already existing in applications; this avoids the need to introduce an additional synchronization mechanism. The main advantage of our checkpoint-ing mechanism is that performance...

2014
Matthew Forshaw A. Stephen McGough Nigel Thomas

Checkpointing is a fault-tolerance mechanism commonly used in High Throughput Computing (HTC) environments to allow the execution of long-running computational tasks on compute resources subject to hardware and software failures and interruptions from resource owners. With increasing scrutiny of the energy consumption of IT infrastructures, it is important to understand the impact of checkpoint...

1996
D. Manivannan Mukesh Singhal

In this paper, we propose a quasi-synchronous checkpointing algorithm and a low-overhead recovery algorithm based on it. The checkpointing algorithm preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-induced checkpoint coordination for the progression of the recovery line which helps bound rollback propagation during a recovery. Thus, it has th...

2014
Andreas Löscher Nicolas Tsiftes Thiemo Voigt Vlado Handziski

Developing sensornet software is difficult partly because of the limited visibility of the system state of deployed nodes. Sensornet checkpointing is a method that allows developers to save and restore full system state of nodes. We present four extensions to sensornet checkpointing—compression, binary diffs, selective checkpointing, and checkpoint inspection—that reduce the time required for c...

1999
Angkul Kongmunvattana Nian-Feng Tzeng

Software distributed shared memory (DSM) improves the programmability of message-passing machines and workstation clusters by providing a shared memory abstract (i.e., a coherent global address space) to programmers. As in any distributed system, however, the probability of software DSM failures increases as the system size grows. This paper presents a new, efficient logging protocol for adapti...

نمودار تعداد نتایج جستجو در هر سال

با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید