Application - Transparent Process - Level Error Recovery for Multicomputers

نویسندگان

Yuval Tamir

Tiffany M. Frazier

چکیده

A multicomputer system consisting of hundreds of processors interconnected by point-to-point links can achieve high performance for many important applications. We propose a new application-transparent, process-level, distributed error recovery scheme for multicomputers. Checkpointing is initiated by timers at intervals determined by the needs of the application. Checkpointing and recovery involve only as much of the system as is necessary: a set of interacting processes. Processes which are not part of the interacting set do not participate in checkpointing or recovery and continue to do useful work. Several checkpoint and/or recovery sessions may be active simultaneously. The scheme does not require significant overhead during normal operation since it is not necessary to make message transmission atomic, acknowledge each message, or transmit check bits with each packet. We discuss variations of our technique using packet-switching or virtual circuits, and compare our scheme to previously published techniques.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Execution-Driven Simulation of Error Recovery Techniques for Multicomputers

DERT (Distributed Error Recovery Testbed) is a testbed for simulation and performance evaluation of several classes of application-transparent distributed error recovery schemes. DERT is built on top of an event-driven, message-passing, object-oriented, multithreaded simulation kernel. Actual compiled distributed applications are instrumented for data collection and executed on the simulated mu...

متن کامل

Coordinated Checkpointing-Rollback Error Recovery for Distributed Shared Memory Multicomputers

Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require unnecessarily high checkpointing frequency and checkpoint traffic, which are sensitive to the frequency of interprocess communication in the applications. For message-passing systems, low overhead error recovery based on coordinated checkpointing allows the frequency of checkpointing to be determin...

متن کامل

Survey of Backward Error Recovery Techniques for Multicomputers Based on Checkpointing and Rollback

For implementing fault-tolerance in multicomputer systems, backward error recovery, based on checkpointing and rollback, is often used. During failurefree operation, the process states are regularly saved, and after a fault is detected, the system is rolled back to a previously saved state. We can distinguish four classes of techniques: semi-automatic techniques, message logging, coordinated ch...

متن کامل

A Software-Based Hardware Fault Tolerance Scheme for Multicomputers

A hardware fault tolerance scheme for large multicomputers executing time-consuming non-interactive applications is described. Error detection and recovery are done mostly by software with little hardware support. The scheme is based on simultaneous execution of identical copies of the application on two subnetworks of the system. Normal system operation is periodically suspended and the logica...

متن کامل

Recovery in Multicomputers with Finite Error Detection Latency

P. Krishna N. H. Vaidya D. K. Pradhan Computer Science Department Texas A&M University College Station, TX 77843-3112 Abstract In most research on checkpointing and recovery, it has been assumed that the processor halts immediately in response to any internal failure (fail-stop model). This paper presents a recovery scheme (independent checkpointing and message logging) for a multicomputer syst...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1989

Application - Transparent Process - Level Error Recovery for Multicomputers

نویسندگان

چکیده

منابع مشابه

Execution-Driven Simulation of Error Recovery Techniques for Multicomputers

Coordinated Checkpointing-Rollback Error Recovery for Distributed Shared Memory Multicomputers

Survey of Backward Error Recovery Techniques for Multicomputers Based on Checkpointing and Rollback

A Software-Based Hardware Fault Tolerance Scheme for Multicomputers

Recovery in Multicomputers with Finite Error Detection Latency

عنوان ژورنال:

اشتراک گذاری