Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols

نویسندگان

  • Darius Buntinas
  • Camille Coti
  • Thomas Hérault
  • Pierre Lemarinier
  • Laurence Pilard
  • Ala Rezmerita
  • Eric Rodriguez
  • Franck Cappello
چکیده

A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant MPIs has led to the development of several fault tolerant MPI environments. Different approaches are being proposed using a variety of fault tolerant message passing protocols based on coordinated checkpointing or message logging. The most popular approach is with coordinated checkpointing. In the literature, two different concepts of coordinated checkpointing have been proposed: blocking and non-blocking. However they have never been compared quantitatively, and their respective scalabilities remain unknown. The contribution of this paper is to provide the first comparison between these two approaches and a study of their scalabilities. We have implemented the two approaches within the MPICH environments and evaluate their performance using the NAS parallel benchmarks. c � 2007 Elsevier B.V. All rights reserved.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Blocking and non-blocking coordinated checkpointing for large scale MPI computation

Nowadays, clusters and grids are made of more and more computing nodes. The programming of multi-processes applications is the most often achieved through message passing. The increase of the number of processes implies that theses applications need to use a fault tolerant message passing library. In this paper, we present two implementations of fault tolerant protocols based on MPICH, a blocki...

متن کامل

An Enhanced MSS-based checkpointing Scheme for Mobile Computing Environment

Mobile computing systems are made up of different components among which Mobile Support Stations (MSSs) play a key role. This paper proposes an efficient MSS-based non-blocking coordinated checkpointing scheme for mobile computing environment. In the scheme suggested nearly all aspects of checkpointing and their related overheads are forwarded to the MSSs and as a result the workload of Mobile ...

متن کامل

Coordinated Checkpoint versus Message Log for Fault Tolerant MPI

MPI is one of the most adopted programming models for Large Clusters and Grid deployments. However, these systems often suffer from network or node failures. This raises the issue of selecting a fault tolerance approach for MPI. Automatic and transparent ones are based on either coordinated checkpointing or message logging associated with uncoordinated checkpoint. They are many protocols, imple...

متن کامل

Automatic Fault - Tolerant MPI

High performance computing platforms such as Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message passing libraries in HPC applications. These two trends raise the need for fault-tolerant MPI. The MPICH-V project focuses on designing, implementing and comparing several automatic fault-tolerant protocols for MPI applicati...

متن کامل

Real Time Snapshot Collection Algorithm for Mobile Distributed Systems with Minimum Number of Checkpoints

Checkpointing is an efficient way of implementing fault tolerance in distributed systems. Mobile computing raises many new issues, such as high mobility, lack of stable storage on mobile hosts (MHs), low bandwidth of wireless channels, limited battery life and disconnections that make the traditional checkpointing protocols unsuitable for such systems. Minimum process non-blocking coordinated c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Future Generation Comp. Syst.

دوره 24  شماره 

صفحات  -

تاریخ انتشار 2008