Multi-Layer Fault Tolerance for Distributed Real-Time Systems
نویسندگان
چکیده
This thesis addresses issues in building fault-tolerant distributed real-time systems. Such systems are increasingly deployed in automotive and avionics applications. We focus on the design and validation of fault tolerance mechanisms. From the design viewpoint, we develop the notion of multi-layer fault tolerance. A fault-tolerant distributed system contains a set of mechanisms that provide error detection and recovery. Those mechanisms can be structured into three different layers, based on where they are implemented and what parts of the system they involve. Circuit layer mechanisms provide the basic fault tolerance implemented in hardware; node layer mechanisms are executed locally in computer nodes; and system layer techniques involve multiple computer nodes to prevent faults from disturbing the system. We make a probabilistic modeling analysis to compare federated to integrated architectures. Federated architectures have few or no fault tolerance mechanisms at the node layer and a node is the elementary unit of failure; integrated architectures provide robust partitioning mechanisms at the node layer in order to ensure that individual tasks are the unit of failure. We compare the reliability of the two architectures and propose a set of guidelines for building integrated architectures. The thesis also addresses the problem of distributed redundancy management. We propose a group membership protocol to achieve consensus on the operational state of all nodes. The protocol is based on the principle that each message sent by a node in the membership is acknowledged by k other nodes, in a system with n nodes. Agreement on node departure is guaranteed if no more than f = k − 1 failures occur during n consecutive transmission slots. Additionally, we provide a solution for the reintegration of restarted nodes in the membership. This protocol is part of the system layer of fault tolerance mechanisms. We address the validation of fault tolerance mechanisms by fault injection. This thesis describes an automated analysis technique to reduce the cost of fault injection campaigns. The analysis uses knowledge of program flow and resource usage to eliminate faults that have no possibility of activation. Our experimental results show that the fault-spaces are reduced by several orders of magnitude, when compared with the usual random approach.
منابع مشابه
A Fault-tolerance Layer for Distributed Fault-tolerant Hard Real-time Systems
This paper describes the conceptual model for, and the implementation of, a software fault-tolerance layer (FT-layer) for distributed fault-tolerant hard realtime systems. This FT-layer provides error detection capabilities, fault-tolerance mechanisms based on active replication, and the interface between the application software running on a node of the distributed system and the communication...
متن کاملFault Tolerance in a Multi-Layered DRE System: A Case Study
Dynamic resource management is a crucial part of the infrastructure for emerging distributed real-time embedded systems, responsible for keeping mission-critical applications operating and allocating the resources necessary for them to meet their requirements. Because of this, the resource manager must be fault-tolerant, with nearly continuous operation. This paper describes our efforts to deve...
متن کامل1 Proactive Fault - Recovery in Distributed Systems
Supporting both real-time and fault-tolerance properties in systems is challenging because real-time systems require predictable end-to-end schedules and bounded temporal behavior in order to meet task deadlines. However, system failures, which are typically unanticipated events, can disrupt the predefined real-time schedule and result in missed task deadlines. Such disruptions to the real-time...
متن کاملA Middleware for Dependable Distributed Real-Time Systems
New middleware is proposed to support the development of dependable distributed real-time systems for avionics, sensor and shipboard computing. Many of these systems require distributed computing in order to perform increasingly complex missions. They also require real-time performance, dependable software, and may face constraints that limit hardware redundancy. Real-time performance and fault...
متن کاملAdding Fault-Tolerance to a Hierarchical DRE System
Dynamic resource management is a crucial part of the infrastructure for emerging mission-critical distributed real-time embedded system. Because of this, the resource manager must be fault-tolerant, with nearly continuous operation. This paper describes an ongoing effort to develop a fault-tolerant multi-layer dynamic resource management capability and the challenges we have encountered, includ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007