User-Level Socket-Based Checkpointing for Distributed and Parallel Computation
نویسندگان
چکیده
We present a preliminary description of a user-level checkpointing package, DMTCP, for Linux. The socket-based approach presents a novel method for checkpointing distributed processes. This includes checkpointing of any dynamically created POSIX threads and forked child processes. It also includes checkpointing of remotely spawned processes via ssh and other mechanisms. As with all user-level checkpointing, no modification of the kernel is needed, and the application code is not modified. The package also checkpoints signal handlers, ordinary file descriptors, socket descriptors, and certain other types of file descriptors. Each checkpointed process has an associated checkpoint file. Hence, process migration, and even migration of an entire computation to a new cluster, are achieved through the simple expedient of copying checkpoint files to a new host. However, process migration adds the additional restriction that the source and destination host must be homogeneous.
منابع مشابه
DMTCP: Scalable User-Level Transparent Checkpointing for Cluster Computations
As the size of clusters increases, failures are becoming increasingly frequent. Applications must become fault tolerant if they are to run for extended periods of time. We present DMTCP (Distributed MultiThreaded CheckPointing), the first user-level distributed checkpointing package not dependent on a specific message passing library. This contrasts with existing approaches either specific to l...
متن کاملApplication-Level Checkpointing Techniques for Parallel Programs
In its simplest form, checkpointing is the act of saving a program’s computation state in a form external to the running program, e.g. the computation state is saved to a filesystem. The checkpoint files can then be used to resume computation upon failure of the original process(s), hopefully with minimal loss of computing work. A checkpoint can be taken using a variety of techniques in every l...
متن کاملAutomatic Parallel Program Checkpointing in Message-Passing Environments
Problem of efficient cluster resources usage is very important, because of high demand for parallel computations. Checkpointing allows to manage cluster computing time more efficiently. In this article parallel programs checkpointing problems are discussed and implementation of automatic parallel checkpointing systems for MPI programs is presented. It is based on simple user-space portable chec...
متن کاملApplication Level Fault Tolerance in Heterogenous Networks of Workstations
We have explored methods for checkpointing and restarting processes within the Distributed object migration environment (Dome), a C++ library of data parallel objects that are automatically distributed over heterogeneous networks of workstations (NOWs). System level checkpointing methods, although transparent to the user, were rejected because they lack support for heterogeneity. We have implem...
متن کاملAsynchronous Checkpointing for PVM Requires Message-Logging
Distributed computing using networked workstations o ers cost-e cient parallel computing, but the higher rate of failure requires e ective fault-tolerance. Asynchronous consistent checkpointing o ers a low-overhead solution. Parallel Virtual Machine (PVM) allows a heterogeneous network of UNIX workstations to serve immmediately as a distributed computer by providing message-passing services imp...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/cs/0701037 شماره
صفحات -
تاریخ انتشار 2007