Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout

نویسندگان

  • Kyungjoo Kim
  • Sivasankaran Rajamanickam
  • George Stelle
  • H. Carter Edwards
  • Stephen Olivier
چکیده

We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block layout. The algorithm-byblocks approach induces a task graph for the factorization. These tasks are inter-related to each other through their data dependences in the factorization algorithm. To process the tasks on various manycore architectures in a portable manner, we also present a portable tasking API that incorporates different tasking backends and device-specific features using an open-source framework for manycore platforms i.e., Kokkos. A performance evaluation is presented on both Intel Sandybridge and Xeon Phi platforms for matrices from the University of Florida sparse matrix collection to illustrate merits of the proposed task-based factorization. Experimental results demonstrate that our task-parallel implementation delivers about 26.6x speedup (geometric mean) over single-threaded incomplete Choleskyby-blocks and 19.2x speedup over serial Cholesky performance which does not carry tasking overhead using 56 threads on the Intel Xeon Phi processor for sparse matrices arising from various application problems. Keywords— Sparse Factorization, Algorithm-by-blocks, 2D Layout, Task Parallelism This technical report is a preprint of a paper intended for publication in a journal or proceedings. Since changes may be made before publication, this preprint is made available with the understanding that anyone wanting to cite or reproduce it ascertains that no published version in journal or proceedings exists. The code described in this paper is publicly available at https://github.com/trilinos/Trilinos/tree/ master/packages/shylu/tacho. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the U.S. Department of Energy under contract DE-AC04-94-AL85000. 1 ar X iv :1 60 1. 05 87 1v 1 [ cs .M S] 2 2 Ja n 20 16

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Task Scheduling using Block Dependency DAG of Block-Oriented Sparse Cholesky Factorizationy

The block-oriented sparse Cholesky factorization decomposes a sparse matrix into rectangular sub-blocks, and handles each block as a computational unit in order to increase data reuse in a hierarchical memory system. As well, the factorization method increases the degree of concurrency with the reduction of communication volumes so that it performs more eeciently on a distributed-memory multipr...

متن کامل

Iterative Solver Based on Incomplete Cholesky Preconditioner for the Parallelisation of a Forging Simulation Software by Mesh Partitioning

A parallel implementation of the preconditioned conjugate residual (PCR) method is described. The method is used to solve the discretized generalised Stokes problem derived from the simulation of complex 2D axisymetrical hot metal forming process. We numerically show that the use of a simple diagonal block (BDS) preconditioner results in a high level of parallelism. Nevertheless in 2D, the iter...

متن کامل

Tiling, Block Data Layout, and Memory Hierarchy Performance

Recently, several experimental studies have been conducted on block data layout in conjunction with tiling as a data transformation technique to improve cache performance. In this paper, we analyze cache and TLB performance of such alternate layouts (including block data layout and Morton layout) when used in conjunction with tiling. We derive a tight lower bound on TLB performance for standard...

متن کامل

Scallability Analysis of Parallel Mic(0) Preconditioning Algorithm for 3d Elliptic Problems

Novel parallel algorithms for the solution of large FEM linear systems arising from second order elliptic partial differential equations in 3D are presented. The problem is discretized by rotated trilinear nonconforming Rannacher–Turek finite elements. The resulting symmetric positive definite system of equations Ax = f is solved by the preconditioned conjugate gradient algorithm. The precondit...

متن کامل

Parallel Subdomain-based Preconditioner for Non-overlapping Domain Decomposition Methods Parallel Subdomain-based Preconditioner for Non-overlapping Domain Decomposition Methods

We present a new parallelizable preconditioner that is used as the local component for a two-level preconditioner similar to BPS. On 2D model problems that exhibit either high anisotropy or discontinuity, we demonstrate its attracting numerical behaviour and compare it to the regular BPS. Finally, to alleviate the construction cost of this new preconditioner, that requires the explicit computat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1601.05871  شماره 

صفحات  -

تاریخ انتشار 2016