Affine Parallelization of Loops with Run-Time Dependent Bounds from Binaries

نویسندگان

Aparna Kotha

Kapil Anand

Timothy Creech

Khaled Elwazeer

Matthew Smithson

Rajeev Barua

چکیده

An automatic parallelizer is a tool that converts serial code to parallel code. This is an important tool because most hardware today is parallel and manually rewriting the vast repository of serial code is tedious and error prone. We build an automatic parallelizer for binary code, i.e. a tool which converts a serial binary to a parallel binary. It is important because: (i) most serial legacy code has no source code available; (ii) it is compatible with all compilers and languages. In the past binary automatic parallelization techniques have been developed and researchers have presented results on small kernels from polybench. These techniques are a good start; however they are far from parallelizing larger codes from the SPEC2006 and OMP2001 benchmark suites which are representative of real world codes. The main limitation of past techniques is the assumption that loop bounds are statically known to calculate loop dependencies. However, in larger codes loop bounds are only known at run-time; hence loop dependencies calculated statically are overly conservative making binary parallelization ineffective. In this paper we present a novel algorithm that enhancing past techniques significantly by guessing the most likely loop bounds using only the memory expressions present in that loop. It then inserts run-time checks to see if these guesses were indeed correct and if correct executes the parallel version of the loop, else the serial version executes. These techniques are applied to the large affine benchmarks in SPEC2006 and OMP2001 and unlike previous methods the speedups from binary are as good as from source. We also present results on the number of loops parallelized directly from a binary with and without this algorithm. Among the 8 affine benchmarks among these suites, the best existing binary parallelization method achieves an average speedup of 1.74X, whereas our method achieves a speedup of 3.38X. This is close to the speedup from source code of 3.15X.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Parallelization of Affine Loops using Dependence and Cache analysis in a Binary Rewriter

Title of dissertation: AUTOMATIC PARALLELIZATION OF AFFINE LOOPS USING DEPENDENCE AND CACHE ANALYSIS IN A BINARY REWRITER Aparna Kotha, Doctor of Philosophy, 2013 Dissertation directed by: Professor Rajeev Barua Department of Electrical and Computer Engineering Today, nearly all general-purpose computers are parallel, but nearly all software running on them is serial. Bridging this disconnect b...

متن کامل

A general compilation algorithm to parallelize and optimize counted loops with dynamic data-dependent bounds

We study the parallelizing compilation and loop nest optimization of an important class of programs where counted loops have a dynamically computed, data-dependent upper bound. Such loops are amenable to a wider set of transformations than general while loops with inductively defined termination conditions: for example, the substitution of closed forms for induction variables remains applicable...

متن کامل

DyVSoR: dynamic malware detection based on extracting patterns from value sets of registers

To control the exponential growth of malware files, security analysts pursue dynamic approaches that automatically identify and analyze malicious software samples. Obfuscation and polymorphism employed by malwares make it difficult for signature-based systems to detect sophisticated malware files. The dynamic analysis or run-time behavior provides a better technique to identify the threat. In t...

متن کامل

Effects of Parallelism Degree on Run-Time Parallelization of Loops

Due to the overhead for exploiting and managing parallelism, run-time loop parallelization techniques with the aim of maximizing parallelism may not necessarily lead to the best performance. In this paper, we present two parallelization techniques that exploit different degrees of parallelism for loops with dynamic crossiteration dependences. The DOALL approach exploits iterationlevel paralleli...

متن کامل

Affine Transformations for Communication Minimized Parallelization and Locality Optimization of Arbitrarily Nested Loop Sequences

A long running program often spends most of its time in nested loops. The polyhedral model provides powerful abstractions to optimize loop nests with regular accesses for parallel execution. Affine transformations in this model capture a complex sequence of execution-reordering loop transformations that improve performance by parallelization as well as better locality. Although a significant am...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

Affine Parallelization of Loops with Run-Time Dependent Bounds from Binaries

نویسندگان

چکیده

منابع مشابه

Automatic Parallelization of Affine Loops using Dependence and Cache analysis in a Binary Rewriter

A general compilation algorithm to parallelize and optimize counted loops with dynamic data-dependent bounds

DyVSoR: dynamic malware detection based on extracting patterns from value sets of registers

Effects of Parallelism Degree on Run-Time Parallelization of Loops

Affine Transformations for Communication Minimized Parallelization and Locality Optimization of Arbitrarily Nested Loop Sequences

عنوان ژورنال:

اشتراک گذاری