Performance Degradation Analysis of GPU Kernels

نویسندگان

  • Jinpeng Lv
  • Guodong Li
  • Alan Humphrey
  • Ganesh Gopalakrishnan
چکیده

Hardware accelerators (currently Graphical Processing Units or GPUs) are an important component in many existing high-performance computing solutions [5]. Their growth in variety and usage is expected to skyrocket [1] due to many reasons. First, GPUs offer impressive energy efficiencies [3]. Second, when properly programmed, they yield impressive speedups by allowing programmers to model their computation around many fine-grained threads whose focus can be rapidly switched during memory stalls. Unfortunately, arranging for high memory access efficiency requires developed computational thinking to properly decompose a problem domain to gain this efficiency. Our work currently addresses the needs of the CUDA [5] approach to programming GPUs. Two important classes of such rules are bank conflict avoidance rules that pertain to CUDA shared memory and coalesced access rules that pertain to global memory. The former requires programmers to generate memory addresses from consecutive threads that fall within separate shared memory banks. The latter requires programmers to generate memory addresses that permit coalesced fetches from the global memory. In previous work [6], we had, to some extent addressed the former through SMTbased methods. Several other efforts also address bank conflicts [7, 8, 4]. In this work, we address the latter requirement—detecting when coalesced access rules are being violated.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Fast GEMM Implementation On a Cypress GPU

We present benchmark results of optimized dense matrix multiplication kernels for Cypress GPU. We write general matrix multiply (GEMM) kernels for single (SP), double (DP) and double-double (DDP) precision. Our SGEMM and DGEMM kernels show ∼ 2 Tflop/s and ∼ 470 Glop/s, respectively. These results for SP and DP correspond to 73% and 87% of the theoretical performance of the GPU, respectively. Cu...

متن کامل

Protecting Real-Time GPU Applications on Integrated CPU-GPU SoC Platforms

Integrated CPU-GPU architecture provides excellent acceleration capabilities for data parallel applications on embedded platforms while meeting the size, weight and power (SWaP) requirements. However, sharing of main memory between CPU applications and GPU kernels can severely affect the execution of GPU kernels and diminish the performance gain provided by GPU. For example, in the NVIDIA Jetso...

متن کامل

Empirical performance modeling of GPU kernels using active learning

We focus on a design-of-experiments methodology for developing empirical performance models of GPU kernels. Recently, we developed an iterative active learning algorithm that adaptively selects parameter configurations in batches for concurrent evaluation on CPU architectures in order to build performance models over the parameter space. In this paper, we illustrate the adoption of the algorith...

متن کامل

GPU-Based spatially variant SR kernel modeling and projections in 3D DIRECT TOF PET Reconstruction

In this study, we develop GPU-based application to model spatially variant system response (SR) kernels and to perform forwardand back-projection operations with the SR kernels, for DIRECT iterative reconstruction approach. Modeling the spatially variant SR kernels is efficiently done and handled by introducing three kinds of look-up tables (LUTs) and storing them in different kinds of GPU memo...

متن کامل

A Unified, Hardware-Fitted, Cross-GPU Performance Model

We present a mechanism to symbolically gather performance-relevant operation counts from numericallyoriented subprograms (‘kernels’) expressed in the Loopy programming system, and apply these counts in a simple, linear model of kernel run time. We use a series of ‘performanceinstructive’ kernels to fit the parameters of a unified model to the performance characteristics of GPU hardware from mul...

متن کامل

Effect of Instruction Fetch and Memory Scheduling on GPU Performance

GPUs are massively multithreaded architectures designed to exploit data level parallelism in applications. Instruction fetch and memory system are two key components in the design of a GPU. In this paper we study the effect of fetch policy and memory system on the performance of a GPU kernel. We vary the fetch and memory scheduling policies and analyze the performance of GPU kernels. As part of...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011