Rollback Recovery Scheme for Distributed Shared Memory Clusters

نویسنده

  • Minakshi Tripathy
چکیده

In this paper, an unified lightweight error recovery scheme based on coordinated checkpointing and rollback for distributed shared memory clusters is proposed. The new scheme maintains multiple globally consistent checkpoints of the state of a distributed shared memory cluster and recovers to a pre-fault checkpoint of the system. It also describes and evaluates the coordinated checkpointing. The coordinated checkpoint neither needs to exchange coordination messages nor adds information to the process messages. It only accesses stable storage when checkpoints are saved. Each of the processes saves its state independently from the other processes. The checkpoint timers are set at different processes. Based on the results of performance evaluation the proposed scheme is shown to outperform the previously proposed checkpoint and recovery schemes for distributed shared memory clusters.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Coordinated Checkpointing-Rollback Error Recovery for Distributed Shared Memory Multicomputers

Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require unnecessarily high checkpointing frequency and checkpoint traffic, which are sensitive to the frequency of interprocess communication in the applications. For message-passing systems, low overhead error recovery based on coordinated checkpointing allows the frequency of checkpointing to be determin...

متن کامل

Ensuring Correct Rollback Recovery in Distributed Shared Memory Systems

Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attractive platform for executing parallel scientific applications. Checkpointing and rollback techniques can be used in such a system to allow the computation to progress in spite of the temporary failure of one or more processing nodes. This paper presents the design of an independent checkpointing met...

متن کامل

Fault-Tolerance Using Cache-Coherent Distributed Shared Memory Systems

In this paper, we describe new protocols augmenting traditional cache coherency mechanisms to implement fault-tolerance based on Recovery Blocks and checkpointing. Concurrent processes compound rollback recovery since the rollback can potentially lead to a "domino-effect" whereby the process is rolled back to the beginning. Several approaches have been proposed to limit the domino effect. One s...

متن کامل

Using Logging and Asynchronous Checkpointing to Implement Recoverable Distributed Shared Memory

Distributed shared memory provides a useful paradigm for developing distributed applications. As the number of processors in the system and running time of distributed applications increase, the likelihood of processor failure increases. A method of recovering processes running in a distributed shared memory environment which minimizes lost work and the cost of recovery is desirable so that lon...

متن کامل

An Improved Logging and Checkpointing Scheme for Recoverable Distributed Shared Memory

The distributed shared memory(DSM) system transforms an existing network of workstations to a powerful shared-memory parallel computer which could deliver superior price/performance. However, with more workstations engaged in the system and longer execution time, the probability of faults increases which could render the system useless. Several checkpointing and logging schemes have been propos...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011