Optimizing OpenCL Kernels for Iterative Statistical Applications on GPUs
نویسندگان
چکیده
We present a study of three important kernels that occur frequently in iterative statistical applications: K-Means, MultiDimensional Scaling (MDS), and PageRank. We implemented each kernel using OpenCL and evaluated their performance on an NVIDIA Tesla GPGPU card. By examining the underlying algorithms and empirically measuring the performance of various components of the kernel we explored the optimization of these kernels by four main techniques: (1) caching invariant data in GPU memory across iterations, (2) selectively placing data in different memory levels, (3) rearranging data in memory, and (4) dividing the work between the GPU and the CPU. The optimizations resulted in performance improvements of up to 5X, compared to näıve OpenCL implementations. We believe that these categories of optimizations are also applicable to other similar kernels. Finally, we draw several lessons that would be useful in not only implementing other similar kernels with OpenCL, but also in devising code generation strategies in compilers that target GPGPUs through OpenCL.
منابع مشابه
Methods for Optimizing OpenCL Applications on Heterogeneous Multicore Architectures
Heterogeneous multicore architectures with CPU and add-on GPUs or streaming processors are now widely used in computer systems. These GPUs provide substantially more computation capability and memory bandwidth compared to traditional multi-cores. Also, because they are highly programmable, they provide the computational performance needed for realistic graphics rendering. Applications with gene...
متن کاملIterative statistical kernels on contemporary GPUs
We present a study of three important kernels that occur frequently in iterative statistical applications: Multi-Dimensional Scaling (MDS), PageRank, and K-Means. We implemented each kernel using OpenCL and evaluated their performance on NVIDIA Tesla and NVIDIA Fermi GPGPU cards using dedicated hardware, and in the case of Fermi, also on the Amazon EC2 cloud-computing environment. By examining ...
متن کاملOCLoptimizer: An Iterative Optimization Tool for OpenCL
Nowadays, computers include several computational devices with parallel capacities, such as multicore processors and Graphic Processing Units (GPUs). OpenCL enables the programming of all these kinds of devices. An OpenCL program consists of a host code which discovers the computational devices available in the host system and it queues up commands to the devices, and the kernel code which defi...
متن کاملFrom CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming
In this work, we evaluate OpenCL as a programming tool for developing performanceportable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has required performance-impacting initializations that do not exist in other languages such as CUDA. Understanding these implications allows us to provide ...
متن کاملCooperative Kernels: GPU Multitasking for Blocking Algorithms (Extended Version)
There is growing interest in accelerating irregular data-parallel algorithms on GPUs. These algorithms are typically blocking, so they require fair scheduling. But GPU programming models (e.g. OpenCL) do not mandate fair scheduling, and GPU schedulers are unfair in practice. Current approaches avoid this issue by exploiting scheduling quirks of today’s GPUs in a manner that does not allow the G...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011