fault recovery

Fault Recovery for DistributedShared Memory

2007

William R. Dieter James E. Lumpp

ion between the user's application and the message passing primitives available on the target systems. Distributed shared memory (DSM) provides users with the abstraction of shared memory on networks of physically distributed machines. This programming model is widely considered to be more intuitive for programmers as compared to message passing languages. Because DSM systems are implemented on...

متن کامل

Phoinix A Fault-Tolerant Object Service in OMA

2007

Deron Liang S. C. Chou S. M. Yuan

The Object Management Architecture (OMA) has been recognized as a de facto standard in the development of object services in distributed computing environment. In a distributed system, the provision for failure-recovery is always a vital design issue. However, the fault-tolerant service has not been extensively considered in the current OMA framework, despite the fact that a increasing number o...

متن کامل

Zotavenie zložitého systému pomocou programovej redundancie procesov

2004

Liberios Vokorokos

Complex system recovery by process programming redundancy This paper presents the recovery of a control system resistant against faults. We come out from parallel computer system with distributed memory and communication based upon exchange of messages. This system consists of processor elements, communication lines and switches. At least one application process is running on each of the proces...

متن کامل

Improving Existing Fault Recovery Policies

2009

Guy Shani Christopher Meek

An automated recovery system is a key component in a large data center. Such a system typically employs a hand-made controller created by an expert. While such controllers capture many important aspects of the recovery process, they are often not systematically optimized to reduce costs such as server downtime. In this paper we describe a passive policy learning approach for improving existing ...

متن کامل

Exploiting Value Prediction for Fault Tolerance

2008

Xuanhua Li Donald Yeung

Technology scaling has led to growing concerns about reliability in microprocessors. Currently, fault tolerance techniques rely on explicit redundant execution for fault detection or recovery which incurs significant performance, power, or hardware overhead. This paper makes the observation that value predictability is a low-cost (albeit imperfect) form of program redundancy that can be exploit...

متن کامل

Evaluation of Fault Tolerance Latency from Real-Time Application's Perspectives

Journal: :IEEE Trans. Computers 2000

Hagbae Kim Kang G. Shin

The Fault-Tolerance Latency (FTL) deened as the time required by all sequential steps taken to recover from an error is important to the design and evaluation of fault-tolerant computers used in safety-critical real-time control systems. To meet timing constraints or avoid dynamic failure, the latency of any fault-handling policy | that consists of several stages like error detection, fault loc...

متن کامل

Segregated failures model for availability evaluation of fault-tolerant systems

2006

Sergiy A. Vilkomir David Lorge Parnas Veena B. Mendiratta Eamonn Murphy

This paper presents a method of estimating the availability of fault-tolerant computer systems with several recovery procedures. A segregated failures model has been proposed recently for this purpose. This paper provides further analysis and extension of this model. The segregated failures model is compared with a Markov chain model and is extended for the situation when the coverage factor is...

متن کامل

Boolean Logic with Fault Tolerant Coding

Journal: :CoRR 2009

B. Baykant Alagoz

Abstract: Error detectable and error correctable coding in Hamming space was researched to discover possible fault tolerant coding constellations, which can implement Boolean logic with fault tolerant property. Basic logic operators of the Boolean algebra were developed to apply fault tolerant coding in the logic circuits. It was shown that application of three-bit fault tolerant codes have pro...

متن کامل

Proactive Service Migration for Long-Running Byzantine Fault Tolerant Systems

Journal: :IET Software 2009

Wenbing Zhao

In this paper, we describe a novel proactive recovery scheme based on service migration for long-running Byzantine fault tolerant systems. Proactive recovery is an essential method for ensuring long term reliability of fault tolerant systems that are under continuous threats from malicious adversaries. The primary benefit of our proactive recovery scheme is a reduced vulnerability window. This ...

متن کامل

On Fault Recovery of Firm Real-Time Computer Systems with Communication and Resource Requirements

2007

Ahmad Abualsamid Mohamed Osama

Fault tolerance and fault recovery are integral parts of real-time systems. The literature addresses the issue of fault recovery via two main methods. One is hardware redundancy, and the other is achieved through task replication. Although lots of research has been done in this area, most of the work fell within one of the two streams, or as a combination of both. One point that did not have en...

متن کامل