Distributed System Fault Tolerance Using Message Logging and Checkpointing
نویسندگان
چکیده
Fault tolerance can allow processes executing in a computer system to survive failures within the system This thesis addresses the theory and practice of transparent fault tolerance methods using message logging and checkpointing in distributed systems A general model for reasoning about the behavior and correctness of these methods is developed and the design implementation and performance of two new low overhead methods based on this model are presented No specialized hardware is required with these new methods The model is independent of the protocols used in the system Each process state is represented by a dependency vector and each system state is represented by a dependency matrix showing a collection of process states The set of system states that have occurred during any single execution of a system forms a lattice with the sets of consistent and recoverable system states as sublattices There is thus always a unique maximum recoverable system state The rst method presented uses a new pessimistic message logging protocol called sender based message logging Each message is logged in the local volatile memory of the machine from which it was sent and the order in which the message was received is returned to the sender as a receive sequence number Message logging overlaps execution of the receiver until the receiver attempts to send a new message Implemented in the V System the maximum measured failure free overhead on dis tributed application programs was under percent and average overhead measured percent or less depending on problem size and communication intensity Optimistic message logging can outperform pessimistic logging since message log ging occurs asynchronously A new optimistic message logging system is presented that guarantees to nd the maximum possible recoverable system state which is not ensured by previous optimistic methods All logged messages and checkpoints are utilized and thus some messages received by a process before it was checkpointed may not need to be logged Although failure recovery using optimistic message log ging is more di cult failure free application overhead using this method ranged from only a maximum of under percent to much less than percent
منابع مشابه
Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing
In a distributed system using message logging and checkpointing to provide fault tolerance there is always a unique maximum recoverable system state regardless of the message logging protocol used The proof of this relies on the observation that the set of system states that have occurred during any single execution of a system forms a lattice with the sets of consistent and recoverable system ...
متن کاملOutput Driven Distributed Optimistic Message Logging and Checkpointing
Although optimistic fault tolerance methods using message logging and checkpointing have the potential to provide highly e cient transparent fault tolerance in distributed systems existing methods are limited by several factors Coordinating the asynchronous message logging progress among all processes of the system may cause signi cant over head limiting their ability to scale to large systems ...
متن کاملThe performance of independent checkpointing in distributed systems
This paper describes performance measurements of an implementation of independent checkpointing in a network of workstations. Independent checkpointing is a simple technique for providing fault tolerance in distributed system, Because processes do not coordinate during checkpointing, this technique has a low run-time overhead. To avoid the classical domino effect, our implementation relies on a...
متن کاملEnhanced Two-level Fault Recovery Scheme Combined with Message Logging
⎯ Checkpointing schemes facilitate fault recovery in distributed systems. The two-level fault recovery scheme of distributed system inherits the merits of both disk-based and diskless checkpointing schemes. The present work extends James S Plank’s Diskless checkpointing scheme (N+1 Parity) by introducing ‘Timeout’ to checkpoint programs with high locality of reference. This mechanism enables ap...
متن کاملAnti-message Logging based Check Pointing Algorithm for Mobile Distributed Systems
Checkpointing is one of the commonly used techniques to provide fault tolerance in distributed systems so that the system can operate even if one or more components have failed. However, mobile computing systems are constrained by low bandwidth, mobility, lack of stable storage, frequent disconnections and limited battery life. Hence checkpointing protocols which have fewer checkpoints are pref...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1989