Eecient, Language-based Checkpointing for Massively Parallel Programs

نویسندگان

  • Sanjeev Krishnan
  • Laxmikant V. Kale
چکیده

Checkpointing and restart is an approach to ensuring forward progress of a program in spite of system failures or planned interruptions. We investigate issues in checkpointing and restart of programs running on massively parallel computers. We identify a new set of issues that have to be considered for the MPP platform, based on which we have designed an approach based on the language and run-time system. Hence our checkpointing facility can be used on virtually any parallel machine in a portable manner, irrespective of whether the operating system supports checkpointing. We present methods to make checkpointing and restart space-and time-eecient, including object-speciic functions that save the state of an object. We present techniques to automatically generate checkpointing code for parallel objects, without programmer intervention. We also present mechanisms to allow the programmer to easily incorporate application speciic knowledge selectively to make the checkpointing more eecient. The techniques developed here have been implemented in the Charm++ parallel object-oriented programming language and run-time system. Performance results are presented for the checkpointing overhead of programs 1 running on parallel machines.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ELMO: extending (sequential) languages with migratable objects-compiler support

EEcient task migration is an important feature in parallel and distributed programs, in particular to support checkpointing and recovery for fault tolerance. It is also very useful in distributed environments like networks of workstations where external loads are often unpredictable and dynamic in nature. We propose simple language extensions (ELMO) to existing sequential programming languages ...

متن کامل

An Object-oriented Implementation Model for the Promoter Language Technical Report

The PROMOTER programming language is designated for data parallel applications that are to run on massively parallel computers with distributed memory. This paper presents an object-oriented implementation model for the PROMOTER language. An object-oriented approach to compile data-parallel programs to message passing programs can reduce design complexity, facilitate reuse of components, and ea...

متن کامل

Project Triton: towards Improved Programmability of Parallel Computers Compilation Techniques. Triton/1 Parallel Architecture

This paper appeard in: The main objective of Project Triton is adequate programmability of massively parallel computers. This goal can be achieved by tightly coupling the design of programming languages and parallel hardware. The approach taken in the Project Triton is to let high-level, machine independent parallel programming languages drive the design of parallel hardware. This approach perm...

متن کامل

Transformation Based Development of Eecient Programs for Massively Parallel Architectures

This paper presents a methodology that is used to detect predeened algorithmic structures (skeletons) in a ne granular program speciication. For each skeleton the best mapping on a particular massively parallel system is known. The skeleton identiication process helps in making good mapping decisions.

متن کامل

Automatic Parallel Program Checkpointing in Message-Passing Environments

Problem of efficient cluster resources usage is very important, because of high demand for parallel computations. Checkpointing allows to manage cluster computing time more efficiently. In this article parallel programs checkpointing problems are discussed and implementation of automatic parallel checkpointing systems for MPI programs is presented. It is based on simple user-space portable chec...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007