Generating SIMD Vectorized Permutations

نویسندگان

  • Franz Franchetti
  • Markus Püschel
چکیده

This paper introduces a method to generate efficient vectorized implementations of small stride permutations using only vector load and vector shuffle instructions. These permutations are crucial for highperformance numerical kernels including the fast Fourier transform. Our generator takes as input only the specification of the target platform’s SIMD vector ISA and the desired permutation. The basic idea underlying our generator is to model vector instructions as matrices and sequences of vector instructions as matrix formulas using the Kronecker product formalism. We design a rewriting system and a search mechanism that applies matrix identities to generate those matrix formulas that have vector structure and minimize a cost measure that we define. The formula is then translated into the actual vector program for the specified permutation. For three important classes of permutations, we show that our method yields a solution with the minimal number of vector shuffles. Inserting into a fast Fourier transform yields a significant speedup.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Portable and Efficient Auto-vectorized Bytecode: a Look at the Interaction between Static and JIT Compilers

Heterogeneity is a confirmed trend of computing systems. Bytecode formats and just-in-time compilers have been proposed to deal with the diversity of the platforms. By hiding architectural details and giving software developers a unified view of the machine, they help improve portability and manage the complexity of large software. Independently, careful exploitation of SIMD instructions has be...

متن کامل

Automatic Generation of Vectorized Montgomery Algorithm

Modular arithmetic is widely used in crytography and symbolic computation. This paper presents a vectorized Montgomery algorithm for modular multiplication, the key to fast modular arithmetic, that fully utilizes the SIMD instructions. We further show how the vectorized algorithm can be automatically generated by the SPIRAL system, as part of the effort for automatic generation of a modular pol...

متن کامل

Decoding billions of integers per second through vectorization

In many important applications—such as search engines and relational database systems—data is stored in the form of arrays of integers. Encoding and, most importantly, decoding of these arrays consumes considerable CPU time. Therefore, substantial effort has been made to reduce costs associated with compression and decompression. In particular, researchers have exploited the superscalar nature ...

متن کامل

Parallel Algorithm for Generating Permutations on Linear Array

Given n items, a parallel algorithm for generating the n! permutations is presented. This algorithm is designed to run on a linear array consisting of n identical processing elements (PEs for short). These PEs are numbered and referred to as PE(i) for 1 < i < n. Each PE is responsible for producing one component of each permutation. Every PE(i) contains six registers, namely: C(i), T(i), K(i), ...

متن کامل

Wave Equation Based Stencil Optimizations on Multi-core CPU

As the engine for seismic imaging algorithms, stencil kernels modeling wave propagation are both computeand memoryintensive. This work targets improving the performance of wave equation based stencil code parallelized by OpenMP on a multi-core CPU. To achieve this goal, we explored two techniques: improving vectorization by using hardware SIMD technology, and reducing memory traffic to mitigate...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008