Application - Transparent Process - Level Error Recovery for Multicomputers
نویسندگان
چکیده
A multicomputer system consisting of hundreds of processors interconnected by point-to-point links can achieve high performance for many important applications. We propose a new application-transparent, process-level, distributed error recovery scheme for multicomputers. Checkpointing is initiated by timers at intervals determined by the needs of the application. Checkpointing and recovery involve only as much of the system as is necessary: a set of interacting processes. Processes which are not part of the interacting set do not participate in checkpointing or recovery and continue to do useful work. Several checkpoint and/or recovery sessions may be active simultaneously. The scheme does not require significant overhead during normal operation since it is not necessary to make message transmission atomic, acknowledge each message, or transmit check bits with each packet. We discuss variations of our technique using packet-switching or virtual circuits, and compare our scheme to previously published techniques.
منابع مشابه
Execution-Driven Simulation of Error Recovery Techniques for Multicomputers
DERT (Distributed Error Recovery Testbed) is a testbed for simulation and performance evaluation of several classes of application-transparent distributed error recovery schemes. DERT is built on top of an event-driven, message-passing, object-oriented, multithreaded simulation kernel. Actual compiled distributed applications are instrumented for data collection and executed on the simulated mu...
متن کاملCoordinated Checkpointing-Rollback Error Recovery for Distributed Shared Memory Multicomputers
Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require unnecessarily high checkpointing frequency and checkpoint traffic, which are sensitive to the frequency of interprocess communication in the applications. For message-passing systems, low overhead error recovery based on coordinated checkpointing allows the frequency of checkpointing to be determin...
متن کاملSurvey of Backward Error Recovery Techniques for Multicomputers Based on Checkpointing and Rollback
For implementing fault-tolerance in multicomputer systems, backward error recovery, based on checkpointing and rollback, is often used. During failurefree operation, the process states are regularly saved, and after a fault is detected, the system is rolled back to a previously saved state. We can distinguish four classes of techniques: semi-automatic techniques, message logging, coordinated ch...
متن کاملA Software-Based Hardware Fault Tolerance Scheme for Multicomputers
A hardware fault tolerance scheme for large multicomputers executing time-consuming non-interactive applications is described. Error detection and recovery are done mostly by software with little hardware support. The scheme is based on simultaneous execution of identical copies of the application on two subnetworks of the system. Normal system operation is periodically suspended and the logica...
متن کاملRecovery in Multicomputers with Finite Error Detection Latency
P. Krishna N. H. Vaidya D. K. Pradhan Computer Science Department Texas A&M University College Station, TX 77843-3112 Abstract In most research on checkpointing and recovery, it has been assumed that the processor halts immediately in response to any internal failure (fail-stop model). This paper presents a recovery scheme (independent checkpointing and message logging) for a multicomputer syst...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1989