COMPUTING SCIENCE Architecting Holistic Fault Tolerance
نویسندگان
چکیده
The optimality and maintainability of fault tolerance mechanisms in a computer system has typically not been a major topic of concern, mostly because fault tolerance is a non-functional system requirement. This paper proposes a Holistic Fault Tolerance architecture, based on a centralised fault tolerance management, with related functionality distributed across the entire system. The most suitable error detection and error recovery strategies for a given application are chosen by a special crosscutting controller depending on error rates, system performance and resource utilisation requirements. We discuss the motivation for introducing this holistic fault tolerance architecture and reason about its benefits from the point of view of optimal system operation and improved maintainability. The advantages and possible implementation challenges of the proposed approach are demonstrated by a real-world application. © 2016 Newcastle University. Printed and published by Newcastle University, Computing Science, Claremont Tower, Claremont Road, Newcastle upon Tyne, NE1 7RU, England. Bibliographical Details Rem Gensh, Ashur Rafiev, Alexander Romanovsky CSR Newcastle University Newcastle upon Tyne, UK Alessandro Garcia Informatics Department PUC-Rio Rio de Janeiro, Brazil Fei Xia, Alex Yakovlev School of EEE Newcastle University Newcastle upon Tyne, UK NEWCASTLE UNIVERSITY Computing Science. Technical Report Series. CS-TR-1505
منابع مشابه
Computing Science Architecting Fault Tolerant Systems Architecting Fault Tolerant Systems Bibliographical Details about the Author Computing Science Architecting Fault Tolerant Systems Architecting Fault Tolerant Systems Bibliographical Details about the Author Suggested Keywords Architecting Fault Tolerant Systems
As building trustworthy (dependable) systems is one of the major challenges faced by software developers, dealing with various threats (such as errors, faults and failures) is becoming one of the main foci of software and system research and development. In the core of ensuring system dependability is acceptance of the fact that errors always happen in spite of all the efforts to eliminate faul...
متن کاملTitle : Experience Report : Evaluation of Holistic Fault Tolerance
Software maintenance is a crucial phase of the software development life cycle. It is important to facilitate this stage, complying with both functional and non-functional requirements. However, very often the main focus is made on the functional features of the application, whereas fault tolerance mechanisms are neglected and as a result do not provide sufficient maintainability and reusabilit...
متن کاملStability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid
Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...
متن کاملImproving the palbimm scheduling algorithm for fault tolerance in cloud computing
Cloud computing is the latest technology that involves distributed computation over the Internet. It meets the needs of users through sharing resources and using virtual technology. The workflow user applications refer to a set of tasks to be processed within the cloud environment. Scheduling algorithms have a lot to do with the efficiency of cloud computing environments through selection of su...
متن کاملAA – A Software Architecture Aware Environment for Dependable Systems
Explicitly considering software architectural information at all times is now a recognized means for addressing software system dependability. In this paper we propose the basic ideas for AA, an architecture aware environment to improve software system dependability. It builds on ideas from architecting dependable systems, control engineering, and software product lines. AA supports fault toler...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016