Advantages of High-Level Synthesis in an OpenCL Based FPGA Programming Methodology
نویسندگان
چکیده
OpenCL (Open Computing Language) is an open standard for the development of parallel applications on a variety of heterogeneous multi-core architectures. Its execution model consists of a host machine connected and controlling a compute device, which performs calculations with a number of parallel computational intensive kernels. Since its introduction, it has been reported to support different CPUs, DSPs and GPUs, in a variety of heterogeneous configurations. Recently, the technological advances in Field Programmable Gate Array (FPGA) devices, with hundreds of GFLOPs, maximum power efficiency and low cost, has turned the parallel processing community towards them, with a number of publications proposing OpenCL as a programming language for FPGAs also. FPGAs, parallel processing and GPUs co-existed for many years, however they were considered isolated and disjoint fields, offering optimizations at different levels in system design. While indeed parallel processing is aiming at a higher optimization level than FPGAs, the main reason for this isolation has been the different programming models and languages used in each case, with FPGA programming considered a difficult and delicate job. This is starting to change however. As FPGAs made hardware design wider accepted (compared to ASICs), a new generation of FPGA programming tools based on High-Level Synthesis (HLS) [Coussy and Morawiec 2008], like Calypto’s CatapultC, Xilinx’s Vivado HLS, Cadence’s C-to-Silicon and Synopsys’ Synphony (to name just a few), is promising to bring hardware closer to software. This paper presents a methodology for the adoption of OpenCL as an FPGA programming environment, based on CatapultC. CatapultC accepts C/C++/SystemC behavioral untimed system descriptions that should follow specific coding guidelines, and through a number of directives (or GUI commands) applies HLS transformations to produce optimized bit-accurate Register Transfer Level (RTL) architectural descriptions. The methodology of this paper is a systematic application of each HLS transformation, by a metaengine placing and tuning CatapultC directives into OpenCL code. The main concern in this process is that even though CatapultC can produce hardware from C, efficient hardware needs effort and architectural synthesis expertise. This expertise is coded in the metaengine, which iterates through different possible and feasible HLS directive applications to generate optimal hardware implementations of OpenCL kernels. With this approach, the opportunities as well as the obstacles imposed to the application developer by the FPGA computing platform and the adoption of C/C++ as input language are investigated, and a systematic way to explore instructionlevel, data-level and thread-level parallelism is given. Furthermore, HLS offers deep design space exploration opportunities and is not used as a special purpose, one phase compiler, passing from software to hardware. The advantages offered with the proposed methodology cover both fields of parallel processing and hardware design. First, by using OpenCL, a programmer can fine tune its algorithm for parallel processing, take simulation results and make critical decisions about data-level and thread-level parallelism. Second, a hardware designer can take kernels and produce optimized implementations through HLS, making critical decisions about instructionlevel parallelism and FPGA device limitations. Finally, the proposed methodology uses a common input language for the whole development cycle, which can improve collaboration, ease integration of CPUs, DSPs, GPUs and FPGAs into a common platform and reduce application development time and cost (which was one of the main goals of the recent DARPA HPCS [Dongarra et al. 2008] program). Compared to recent publications that are considering using parallel programming models (OpenCL and CUDA) as a programming language for FPGAs also, our work takes full advantage of HLS. In [Mingjie et al. 2010] and [Owaida et al. 2011] two methodologies are given for mapping OpenCL kernels to reconfigurable hardware. The methodologies involve compiler optimizations that map kernel code into fixed hardware templates, which are then written in hardware description languages. While both methodologies are complete and cover many different issues (computations, memory hierarchies and interfacing), the resulting hardware cores are template-based and do not cover in detail lower level design issues. In [Jaaskelainen et al. 2010], the authors present another similar methodology, targeting Application-Specific Processors (ASPs). They use a custom design environment and map OpenCL kernels into either common or custom ASP instructions. Another approach, closer to this paper is reported in [Papakonstantinou et al. 2009], where CUDA code is passed though another HLS tool. Directives and pragmas are used to control the tool but no systematic and iterative application is reported, as in the proposed methodology. HLS is rather considered as a single pass procedure. From the industrial point of view, FPGA vendors have been actively involved in the use of OpenCL for FPGA programming (Altera SDK for OpenCL [Czajkowski et al. 2012]), offering a specific framework that takes advantage of the parallelism expressed in OpenCL code and utilize a custom HLS step. As in [Papakonstantinou et al. 2009] however, no systematic HLS design space exploration is performed. On the contrary, HLS is considered a time consuming task in the whole design process so, HLS iterations are avoided. On the contrary, our work is based on HLS iterations for better design space exploration and improved instruction-level parallelism opportunities. The main idea of this paper is the proposal of a semi-automated methodology to translate OpenCL code into a form suitable for CatapultC, with which hardware is synthesized using HLS. Since OpenCL is based on C99, which is also recognized by CatapultC, this translation does not bring major changes to the input code. The whole process is performed by a custom source-to-source translator (at this time implemented as a preliminary version through script files), that either infers (if possible) or accepts by the user (this justifies the term semi-automated) details to OpenCL code like pointer sizes, loop boundaries, input parameters and expected return values. The basic steps are the following.
منابع مشابه
Energy-efficient FPGA Implementation of the k-Nearest Neighbors Algorithm Using OpenCL
Modern SoCs are getting increasingly heterogeneous with a combination of multi-core architectures and hardware accelerators to speed up the execution of computeintensive tasks at considerably lower power consumption. Modern FPGAs, due to their reasonable execution speed and comparatively lower power consumption, are strong competitors to the traditional GPU based accelerators. High-level Synthe...
متن کاملAcceleration Framework for FPGA Implementation of OpenVX Graph Pipelines
Computer vision processing is computationally expensive and several acceleration solutions have been proposed. Among them, FPGAs offer a promising direction. Vision application are typically written in languages such as C/C++ and they are often difficult to compile into an efficient FPGA implementation. OpenVX is a set of basic, widely used vision kernels. Vision pipelines can be defined as gra...
متن کاملA Comparison of High-Level Design Tools for SoC-FPGA on Disparity Map Calculation Example
Modern SoC-FPGA that consists of FPGA with embedded ARM cores is being popularized as an embedded vision system platform. However, the design approach of SoCFPGA applications still follows traditional hardware-software separate workflow, which becomes the barrier of rapid product design and iteration on SoC-FPGA. High-Level Synthesis (HLS) and OpenCL-based system-level design approaches provide...
متن کاملCHO: A Benchmark Suite for OpenCL-based FPGA Accelerators
Programming FPGAs with OpenCL-based high-level synthesis frameworks is gaining attention with a number of commercial and research frameworks announced. However, there are no benchmarks for evaluating these frameworks. To this end, we present CHO benchmark suite an extension of CHStone, a commonly used C-based high-level synthesis benchmark suite, for OpenCL. We characterise CHO at various level...
متن کاملOpenCL 2.0 for FPGAs using OCLAcc
Designing hardware is a time-consuming and complex process. Realization of both, embedded and highperformance applications can benefit from a design process on a higher level of abstraction. This helps to reduce development time and allows to iteratively test and optimize the hardware design during development, as common in software development. We present our tool, OCLAcc, which allows the gen...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012