Tera-Scale 1D FFT with Low-Communication Algorithm and IntelR
نویسندگان
چکیده
This paper demonstrates the first tera-scale performance of Intel © Xeon Phi TM coprocessors on 1D fft computations. Applying a disciplined performance programming methodology of sound algorithm choice, valid performance model, and well-executed optimizations, we break the tera-flop mark on a mere 64 nodes of Xeon Phi and reach 6.7 tflops with 512 nodes, which is 1.5× than achievable on a same number of Intel © Xeon © nodes. It is a challenge to fully utilize the compute capability presented by many-core widevector processors for bandwidth-bound fft computation. We leverage a new algorithm, Segment-of-Interest fft, with low inter-node communication cost, and aggressively optimize data movements in node-local computations, exploiting caches. Our coordination of low communication algorithm and massively parallel architecture for scalable performance is not limited to running fft on Xeon Phi; it can serve as a reference for other bandwidth-bound computations and for emerging hpc systems that are increasingly communication limited.
منابع مشابه
Development of an FPGA-Based Two-Transform Pulse Compressor Mr. Skip
Recent advances in Field Programmable Gate Array (FPGA) technologies have resulted in high gate count and high performance FPGA parts which offer a cost-effective and short development cycle solution for computation intensive signal processor applications. These parts provide an attractive middle ground between Commercial Off-the-Shelf (COTS) boards employing Digital Signal Processor (DSP) chip...
متن کاملParallel Implementations of the Split-Step Fourier Method for Solving Nonlinear Schrödinger Systems
We present a parallel version of the well-known Split-Step Fourier method (SSF) for solving the Nonlinear Schrödinger equation, a mathematical model describing wave packet propagation in fiber optic lines. The algorithm is implemented under both distributed and shared memory programming paradigms on the Silicon Graphics/Cray Research Origin 200. The 1D Fast-Fourier Transform (FFT) is paralleliz...
متن کاملParallel Implementation of Multidimensional Transforms without Interprocessor Communication
ÐThis paper presents a modular algorithm which is suitable for computing a large class of multidimensional transforms in a general purpose parallel environment without interprocessor communication. Since it is based on matrix-vector multiplication, it does not impose restrictions on the size of the input data as many existing algorithms do. The method is fully general since it does not depend o...
متن کاملUsing WPT as a New Method Instead of FFT for Improving the Performance of OFDM Modulation
Orthogonal frequency division multiplexing (OFDM) is used in order to provide immunity against very hostile multipath channels in many modern communication systems.. The OFDM technique divides the total available frequency bandwidth into several narrow bands. In conventional OFDM, FFT algorithm is used to provide orthogonal subcarriers. Intersymbol interference (ISI) and intercarrier interferen...
متن کاملمحاسبه سریع انتگرالهای تشعشعی با روش FFT جهت کاربرد در تحلیل آنتنهای بازتابنده شکلیافته
Normal 0 false false false EN-US X-NONE AR-SA MicrosoftInternetExplorer4 ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013