Optimizing Strassen's multiplication algorithm for modern processors : A study in optimizing matrix multiplications for large matrices on modern CPUs

University essay from KTH/Skolan för datavetenskap och kommunikation (CSC)

Author: Robert Welin-berger; Anton Bäckström; [2016]

Keywords: ;

Abstract: This paper examines how to write code to gain high performance on modern computers as well as the importance of well planned data structures. The experiments were run on a computer with an Intel i5-5200U CPU, 8GB of RAM running Linux Mint 17.For the measurements Winograd's variant of Strassen's matrix multiplication algorithm was implemented and eventually compared to Intel's math kernel library (MKL). A quadtree data structure was implemented to ensure good cache locality. Loop unrolling and tiling was combined to improve cache performance on both L1 and L2 cache taking into regard the out of order behavior of modern CPUs. Compiler hints were partially used but a large part of the time critical code was written in pure assembler. Measurements of the speed performance of both floats and doubles were performed and substantial differences in running times were found.While there was a substantial difference between the best implementation and MKL for both doubles and floats at smaller sizes, a difference of only 1\% in execution time was achieved for floats at the size of 2^14. This was achieved without any specific tuning and could be expected to be improved if more time was spent on the implementation.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)