Optimizing Charm++ over MPI
نویسندگان
چکیده
Charm++ may employ any of a myriad network-specific APIs for handling communication, which are usually promoted as being faster than its catch-all MPI module. Such a performance difference not only causes development effort to be spent on tuning vendor-specific APIs but also discourages hybrid Charm++/MPI applications. We investigate this disparity across several machines and applications, ranging from small InfiniBand clusters to Blue Gene/Q supercomputers and from synthetic benchmarks to large-scale biochemistry codes. We demonstrate the use of one feature from the recent MPI-3 standard to bridge this gap where applicable, and we discuss what can be done today.
منابع مشابه
Charm++ & MPI: Combining the Best of Both Worlds
MPI and Charm++ embody two distinct perspectives for writing parallel programs. While MPI provides a processor-centric, user-driven model for developing parallel codes, Charm++ supports work-centric, overdecompositionbased, system-driven parallel programming. One or the other can be the best or most natural fit for distinct modules that constitute a parallel application. In this paper, we prese...
متن کاملOptimizing Point-to-Point Communication between Adaptive MPI Endpoints in SharedMemory
Correspondence *SamWhite, Email: [email protected] Abstract AdaptiveMPI is an implementation of theMPI standard that supports the virtualization of ranks as user-level threads, rather than OS processes. In this work, we optimize the communication performance of AMPI based on the locality of the endpoints communicating within a cluster of SMP nodes. We differentiate between point-to-point mes...
متن کاملObject-Based Adaptive Load Balancing for MPI Programs∗
Parallel Computational Science and Engineering (CSE) applications often exhibit irregular structure and dynamic load patterns. Many such applications have been developed using procedural languages (e.g. Fortran) in message passing parallel programming paradigm (e.g. MPI) for distributed memory machines. Incorporating dynamic load balancing techniques at the application-level involves significan...
متن کاملHandling Transient and Persistent Imbalance Together in Distributed and Shared Memory
The recent trend of rapid increase in the number of cores per chip has resulted in vast amount of on-node parallelism. Not only the number of cores per node is increasing substantially but also the cores are becoming heterogeneous. The high variability in the performance of the hardware components introduce imbalance due to heterogeneity. The applications are also becoming more complex resultin...
متن کاملOptimizing MPI Collectives for X1
Traditionally MPI collective operations have been based on point-to-point messages, with possible optimizations for system topologies and communication protocols. The Cray X1 scatter/gather hardware and shared memory mapping features allow for significantly different approaches to MPI collectives leading to substantial performance gains over standard methods, especially for short message length...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013