High performance sparse multifrontal solvers on modern GPUs
نویسندگان
چکیده
We have ported the numerical factorization and triangular solve phases of sparse direct solver STRUMPACK to GPU . implements LU using multifrontal algorithm, which performs most its operations in dense linear algebra on so-called frontal matrices various sizes. Our implementation off-loads these operations, as well scatter–gather between matrices. For larger matrices, our relies vendor libraries such cuBLAS cuSOLVER for NVIDIA GPUs rocBLAS rocSOLVER AMD GPUs. smaller we developed custom CUDA HIP kernels reduce kernel launch overhead. Overall, high performance is achieved by identifying submatrix factorizations corresponding sub-trees assembly tree fit entirely memory. The multi-GPU setting uses SLATE (Software Linear Algebra Targeting Exascale) a modern GPU-aware replacement ScaLAPACK. On 4 nodes SUMMIT code runs ? 10 × faster when all 24 V100 compared it only 168 POWER9 cores. 8 nodes, 48 GPUs, reaches over 50TFlop/s. Compared SuperLU, single V100, set 17 but one matrix, average 5 (median ) • Ported GPU. off-loading Relying Nvidia Multi-GPU 5x faster.
منابع مشابه
Breakthroughs in Sparse Solvers for GPUs
The CUDA Center of Excellence (CCOE) at UTK targets the development of innovative algorithms and technologies to tackle challenges in Heterogeneous High Performance Computing. Over the last year, the CCOE at UTK developed CUDA-based breakthrough technologies in sparse solvers for GPUs. Here, we describe the main ones – a sparse iterative solvers package, a communication-avoiding (CA) sparse ite...
متن کاملSparse Approximate Inverse Preconditioners for Iterative Solvers on GPUs
For the solution of large systems of linear equations, iterative solvers with preconditioners are typically employed. However, the design of preconditioners for the black-box case, in which no additional information about the underlying problem is known, is very difficult. The most commonly employed method of incomplete LU factorizations is a serial algorithm and thus not well suited for the ma...
متن کاملBlock Low-Rank (BLR) approximations to improve multifrontal sparse solvers
Matrices coming from elliptic Partial Differential Equations (PDEs) have been shown to have a lowrank property: well defined off-diagonal blocks of their Schur complements can be approximated by low-rank products. In the multifrontal context, this can be exploited within the fronts in order to obtain a substantial reduction of the memory requirement and an efficient way to perform many of the b...
متن کاملPreconditioned Krylov solvers on GPUs
In this paper, we study the effect of enhancing GPU-accelerated Krylov solvers with preconditioners. We consider the BiCGSTAB, CGS, QMR, and IDR( s ) Krylov solvers. For a large set of test matrices, we assess the impact of Jacobi and incomplete factorization preconditioning on the solvers’ numerical stability and time-to-solution performance. We also analyze how the use of a preconditioner imp...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Parallel Computing
سال: 2022
ISSN: ['1872-7336', '0167-8191']
DOI: https://doi.org/10.1016/j.parco.2022.102897