High performance sparse multifrontal solvers on modern GPUs

نویسندگان

چکیده

We have ported the numerical factorization and triangular solve phases of sparse direct solver STRUMPACK to GPU . implements LU using multifrontal algorithm, which performs most its operations in dense linear algebra on so-called frontal matrices various sizes. Our implementation off-loads these operations, as well scatter–gather between matrices. For larger matrices, our relies vendor libraries such cuBLAS cuSOLVER for NVIDIA GPUs rocBLAS rocSOLVER AMD GPUs. smaller we developed custom CUDA HIP kernels reduce kernel launch overhead. Overall, high performance is achieved by identifying submatrix factorizations corresponding sub-trees assembly tree fit entirely memory. The multi-GPU setting uses SLATE (Software Linear Algebra Targeting Exascale) a modern GPU-aware replacement ScaLAPACK. On 4 nodes SUMMIT code runs ? 10 × faster when all 24 V100 compared it only 168 POWER9 cores. 8 nodes, 48 GPUs, reaches over 50TFlop/s. Compared SuperLU, single V100, set 17 but one matrix, average 5 (median ) • Ported GPU. off-loading Relying Nvidia Multi-GPU 5x faster.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Breakthroughs in Sparse Solvers for GPUs

The CUDA Center of Excellence (CCOE) at UTK targets the development of innovative algorithms and technologies to tackle challenges in Heterogeneous High Performance Computing. Over the last year, the CCOE at UTK developed CUDA-based breakthrough technologies in sparse solvers for GPUs. Here, we describe the main ones – a sparse iterative solvers package, a communication-avoiding (CA) sparse ite...

متن کامل

Sparse Approximate Inverse Preconditioners for Iterative Solvers on GPUs

For the solution of large systems of linear equations, iterative solvers with preconditioners are typically employed. However, the design of preconditioners for the black-box case, in which no additional information about the underlying problem is known, is very difficult. The most commonly employed method of incomplete LU factorizations is a serial algorithm and thus not well suited for the ma...

متن کامل

Block Low-Rank (BLR) approximations to improve multifrontal sparse solvers

Matrices coming from elliptic Partial Differential Equations (PDEs) have been shown to have a lowrank property: well defined off-diagonal blocks of their Schur complements can be approximated by low-rank products. In the multifrontal context, this can be exploited within the fronts in order to obtain a substantial reduction of the memory requirement and an efficient way to perform many of the b...

متن کامل

Performance of PDE Sparse Solvers on Hypercubes

متن کامل

Preconditioned Krylov solvers on GPUs

In this paper, we study the effect of enhancing GPU-accelerated Krylov solvers with preconditioners. We consider the BiCGSTAB, CGS, QMR, and IDR( s ) Krylov solvers. For a large set of test matrices, we assess the impact of Jacobi and incomplete factorization preconditioning on the solvers’ numerical stability and time-to-solution performance. We also analyze how the use of a preconditioner imp...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Parallel Computing

سال: 2022

ISSN: ['1872-7336', '0167-8191']

DOI: https://doi.org/10.1016/j.parco.2022.102897