Using Virtualization to Validate Fa Ult-tolerant Distributed Systems
نویسندگان
چکیده
Asynchronous events and complex system state distributed across independent nodes make exposure and diagnosis of flaws in distributed systems a challenge. The difficulties are exacerbated when the goal is to validate fault tolerance mechanisms that are activated only by the occurrence of errors, which are, by nature, rare. Validation of fault tolerance mechanisms is often done by injecting faults that emulate the actual faults and ‘‘stress’’ the functionality of the resilience mechanisms. Validation campaigns lasting days and involving thousands of fault injections are often necessary. We present an infrastructure that combines virtualization and software-implemented fault injection to automate validation campaigns and support the analysis of the behavior of a distributed system under test. Virtualization enables: 1) a flexible fault injector capable of emulating a wide variety of faults, and 2) a mechanism for autonomously recovering faulty nodes so that the campaign can continue running on a target system that is fully functional. As a case study we use this infrastructure to validate a Byzantine-fault-tolerant cluster manager. Over 1280 hours of fault injections yielded the exposure of 11 unique flaws in the cluster manager.
منابع مشابه
Virtualization Technologies for DTN Testbeds
At present, Internet is based on the availability of a continuous path from the source to the sink node and on limited delays. These assumptions do not hold in “challenged networks”, which comprise a wide variety of different environments, from sensor networks to space communications (including satellite systems). These networks are the preferred target of Delay/Disruption Tolerant Networking (...
متن کاملCritical Success Factors for Data Virtualization: A Literature Review
Data Virtualization (DV) has become an important method to store and handle data cost-efficiently. However, it is unclear what kind of data and when data should be virtualized or not. We applied a design science approach in the first stage to get a state of the art of DV regarding data integration and to present a concept matrix. We extend the knowledge base with a systematic literature review ...
متن کاملCompositional Programming and Testing of Dynamic Distributed Systems
Distributed systems are notoriously difficult to get right as they must deal with concurrency and failures. This paper proposes techniques for building reliable distributed systems with two central contributions: (1) We propose a module system based on the theory of compositional trace refinement for dynamic systems consisting of asynchronouslycommunicating state machines, where state machines ...
متن کاملA decentralized fault tolerant control strategy for multi-robot systems
The paper presents a fault tolerance control strategy for distributed multi-robot systems. The proposed approach is based on a distributed controller-observer architecture that allows each robot to estimate the global system state using local communication. We derive residual dynamics that allows each robot to detect and isolate faults of other robots, even if they are not directly connected. T...
متن کاملLOT: A Robust Overlay for Distributed Range Query Processing
Large-scale data-centric services are often handled by clusters of computers that include hundreds of thousands of computing nodes. However, traditional distributed query processing techniques fail to handle the large-scale distribution, peer-to-peer communication and frequent disconnection. In this paper, we introduce LOT, a robust, fault-tolerant and highly distributed overlay network for lar...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010