Development of Stockham Fast Fourier Transform using Data-Centric Parallel Programming
Abstract: Writing efficient scientific applications for modern High Performance Computing (HPC) systems has become much harder during the last years due to the rise of heterogeneous computer architectures. The inclusion of hardware accelerators has caused an increase in the number of required technologies to reach high performance, a change that has programming these systems outside the skill-set of a domain scientist. The Data-Centric (DaCe) project aims to ease programming efforts and improve performance portability by separating the implementation of the program from the optimization. To do this, DaCe introduces an Intermediate Representation (IR) called Stateful DataFlow multi- Graph (SDFG), a transformable graph that can represent a wide variety of algorithms. Currently, the DaCe framework only has limited examples, many of which are either stencil calculations or graph calculations. An evaluation of the implementation of a general-purpose algorithm in DaCe would be helpful to show the status of the framework. The Fast Fourier Transform (FFT) is very important to modern science, as it is used for example to perform digital signal processing, solving partial differential equations, and spectral analysis. The transform computes the frequencyspace representation of a sequence with length n in O(n log(n)) operations. The FFT is a sparse representation of the Discrete Fourier Transform (DFT), which takes O(n²) operations to perform. The reduction of required operations enabled the use of Fourier analysis in modern science. In this thesis we implement the Stockham FFT algorithm to evaluate the process of implementing algorithms in DaCes. The algorithm was implemented using the so-called restrictive Python in the DaCe framework, optimized and ported to use Graphics Processing Units (GPUs). In comparison to other state-of-the-art solutions, Fastest Fourier Transform in the West (FFTW) and CUDA Fast Fourier Transform library (cuFFT) our implementation reaches a maximum/minimum of 54 %/0.33% on Central Processing Unit (CPU) and 62 %/10% on GPU respectively on HPC hardware.
AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)