Optimizing OpenMP Parallelized DGEMM Calls on SGI Altix 3700
نویسندگان
چکیده
Using functions of parallelized mathematical libraries is a common way to accelerate numerical applications. Computer architectures with shared memory characteristics support different approaches for the implementation of such libraries, usually OpenMP or MPI. This paper’s content is based on the performance comparison of DGEMM calls (floating point matrix multiplication, double precision) with different OpenMP parallelized numerical libraries, namely Intel MKL and SGI SCSL, and how they can be optimized. Additionally, we have a look at the memory placement policy and give hints for initializing data. Our attention has been focused on a SGI Altix 3700 Bx2 system using BenchIT [1] as a very convenient performance measurement suite for the examinations. 1 Measurement Environment For a detailed analysis of a system architecture by parameter studies, the choice of a suitable measuring framework is an important decision. To benchmark the DGEMM calls we use BenchIT. This performance measurement suite helps to compare different algorithms, implementations of algorithms, features of the software stack, and hardware details of whole systems. It has been designed to run many microbenchmarks on every POSIX 1.003 compatible system in a very userfriendly way. BenchIT has been developed at the Center for Information Services and High Performance Computing (ZIH) at the Technische Universität Dresden and was previously mentioned at [2–4]. Sources and results are freely available at [1]. 2 The SGI Altix 3700 Bx2 System 2.1 System Architecture The SGI [5] Altix 3700 Bx2 is a ccNUMA shared memory system based on Intel Itanium 2 processors and SGI’s scalable node architecture SN2. In developing this, special attention has been paid to building a highly scalable computer
منابع مشابه
Performance of OSCAR Multigrain Parallelizing Compiler on SMP Servers
This paper describes performance of OSCAR multigrain parallelizing compiler on various SMP servers, such as IBM pSeries 690, Sun Fire V880, Sun Ultra 80, NEC TX7/i6010 and SGI Altix 3700. The OSCAR compiler hierarchically exploits the coarse grain task parallelism among loops, subroutines and basic blocks and the near fine grain parallelism among statements inside a basic block in addition to t...
متن کاملAn evaluation of Itanium 2-based high-end servers
We report the results of an evaluation project on four Itanium 2-based high-end servers: the Bull NovaScale 5160, the NEC TX7, and the SGI Altix 3700. For the evaluation the EuroBen benchmarks were used. Single-CPU, OpenMP, and MPI codes were run on each of the systems on a maximum of 16 processors. All codes were run in single-user mode in order to obtain a clear view on the architectural char...
متن کاملHigh Performance FFT on SGI Altix 3700
We have developed a high-performance FFT on SGI Altix 3700, improving the efficiency of the floating-point operations required to compute FFT by using a kind of loop fusion technique. As a result, we achieved a performance of 4.94 Gflops at 1-D FFT of length 4096 with an Itanium 2 1.3 GHz (95% of peak), and a performance of 28 Gflops at 2-D FFT of 4096 with 32 processors. Our FFT kernel outperf...
متن کاملHigh performance computing using MPI and OpenMP on multi-core parallel systems
The rapidly increasing number of cores in modern microprocessors is pushing the current high performance computing (HPC) systems into the petascale and exascale era. The hybrid nature of these systems—distributed memory across nodes and shared memory with non-uniform memory access within each node—poses a challenge to application developers. In this paper, we study a hybrid approach to programm...
متن کاملInterconnect Performance Evaluation of SGI Altix 3700 Cray X1, Cray Opteron, and Dell PowerEdge
We study the performance of inter-process communication on four high-speed multiprocessor systems using a set of communication benchmarks. The goal is to identify certain limiting factors and bottlenecks with the interconnect of these systems as well as to compare between these interconnects. We used several benchmarks to examine network behavior under different communication patterns and numbe...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006