Register Caching for Energy Efficient GPGPU Tensor Core Computing

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: The General-Purpose GPU (GPGPU) has emerged as the predominant computing device for extensive parallel workloads in the fields of Artificial Intelligence (AI) and Scientific Computing, primarily owing to its adoption of the Single Instruction Multiple Thread architecture, which not only provides a wealth of thread context but also effectively hide the latencies exposed in the single threads executions. As computational demands have evolved, modern GPGPUs have incorporated specialized matrix engines, e.g., NVIDIA’s Tensor Core (TC), in order to deliver substantially higher throughput for dense matrix computations compared with traditional scalar or vector architectures. Beyond mere throughput, energy efficiency is a pivotal concern in GPGPU computing. The register file is the largest memory structure on the GPGPU die and typically accounts for over 20% of the dynamic power consumption. To enhance energy efficiency, GPGPUs incorporate a technique named register caching borrowed from the realm of CPUs. Register caching captures temporal locality among register operands to reduce energy consumption within a 2- level register file structure. The presence of TC raises new challenges for Register Cache (RC) design, as each matrix instruction applies intensive operand delivering traffic on the register file banks. In this study, we delve into the RC design trade-offs in GPGPUs. We undertake a comprehensive exploration of the design space, encompassing a range of workloads. Our experiments not only reveal the basic design considerations of RC but also clarify that conventional caching strategies underperform, particularly when dealing with TC computations, primarily due to poor temporal locality and the substantial register operand traffic involved. Based on these findings, we propose an enhanced caching strategy featuring a look-ahead allocation policy to minimize unnecessary cache allocations for the destination register operands. Furthermore, to leverage the energy efficiency of Tensor Core computing, we highlight an alternative instruction scheduling framework for Tensor Core instructions that collaborates with a specialized caching policy, resulting in a remarkable reduction of up to 50% in dynamic energy consumption within the register file during Tensor Core GEMM computations.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)