Application-transparent fault tolerance in distributed systems
نویسنده
چکیده
We present a new software architecture in which all concepts necessary to achieve fault tolerance can be added to an application automatically without any source code changes. As a case study, we consider the problem of providing a reliable service despite node failures by executing a group of replicat-ed servers. Replica creation and management as well as failure detection and recovery are performed automatically by a separate fault tolerance layer (ft-layer) which is inserted between the server application and the operating system kernel. The layer is invisible for the application since it provides the same functional interface as the operating system kernel, thus making the fault tolerance property of the service completely transparent for the application. A major advantage of our architecture is that the layer encapsulates both fault tolerance mechanisms and policies. This allows for maximum flexibility in the choice of appropriate methods for fault tolerance without any changes in the application code. Due to the steadily increasing complexity of application programs more and more sophisticated concepts are required to make these programs resistent against system component failures. During the past years several methods for achieving this fault tolerance have been developed and meanwhile are well understood. Basic concepts which are necessary for reliable applications have been identified and thoroughly studied (process surveillance [Bec91], membership information distribution [Cri91], checkpoint-ing and recovery [KoT87], reliable and order-preserving multicast protocols [BeG93], various replication mechanisms [Pow91], etc.). However, up to now little has been done to support a non-expert application programmer in building reliable applications. Although their development is facilitated by toolkits (e.g. the ISIS toolkit [BiJ87]) which provide the basic building blocks of fault tolerance, the programmer still requires precise knowledge of the concepts implemented by these toolkits in order to use them successfully. Even with the help of toolkits, fault tolerance policies and mechanisms still have to be programmed explicitly as part of the application code. This close interrelation of the fault tolerance mechanisms and the application code has two severe drawbacks: • Distributed fault tolerant applications become much more difficult to test because the implemented fault tolerance concepts have to be taken into account when selecting proper test scenarios. Often the failure handling parts of such an application are particularly difficult to test as subtle combinations of failure situations have to be simulated. • Once the fault tolerance policy is chosen, the application is restricted only to this choice. Switching …
منابع مشابه
A Distributed Web Information Systems Platform Supporting High Responsiveness and Fault Tolerance
Distributed replication of databases underlying web information systems is a viable way to solve problems of responsiveness and fault tolerance. We describe the middleware platform DIWISA for transparent object-oriented development of distributed web information systems. Support for distributed replication of information as well as fault tolerance and error recovery of web information systems i...
متن کاملTransparent Fault Tolerance for Web Services Based Architectures
Service-based architectures enable the development of new classes of Grid and distributed applications. One of the main capabilities provided by such systems is the dynamic and flexible integration of services, according to which services are allowed to be a part of more than one distributed system and simultaneously serve different applications. This increased flexibility in system composition...
متن کاملMPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes
Fault-tolerance is an essential element to the distributed system which requires the reliable computation environment. In spite of extensive researches over two decades, practical fault-tolerance systems have not been provided. It is due to the high overhead and the unhandiness of the previous fault-tolerance systems. In this paper, we propose MPICH-GF, a user-transparent checkpointing system f...
متن کاملPerformance Evaluation of Fault Tolerance for Parallel Applications in Networked Environments
This paper presents the performance evaluation of a software fault manager for distributed applications. Dubbed STAR, it uses the natural redundancy existing in networks of workstations to offer a high level of fault tolerance. Fault management is transparent to the supported parallel applications. STAR is application independent, highly configurable and easily portable to UNIX-like operating s...
متن کاملEfficient Transparent Application Recovery In Client-Server Information Systems
Database systems recover persistent data, providing high database availability. However, database applications, typically residing on client or “middle-tier” application-server machines, may lose work because of a server failure. This prevents the masking of server failures from the human user and substantially degrades application availability. This paper aims to enable high application availa...
متن کاملA Software Implemented Fault-tolerance Layer for Reliable Computing on Massively Parallel Computers and Distributed Computing Systems
A novel architecture for a software-implemented fault-tolerance layer for application reliability on massively parallel computers and distributed computing systems is proposed. This is the rst attempt at providing a purely software-based, user-level solution for fault detection, reconnguration, and recovery in a parallel environment. The symmetrically distributed, multi-tiered layer envelopes u...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1994