GPU integration for Deep Learning on YARN

University essay from KTH/Skolan för informations- och kommunikationsteknik (ICT)

Author: Robin Andersson; [2017]

Keywords: ;

Abstract: In recent years, there have been many advancements in the field of machine learning and adoption of its techniques in many industries. Deep learning is a sub-field of machine learning that is largely attributed to many of recent innovative applications such as autonomous driving systems. Training a deep learning model is a computationally intensive task that in many cases is inefficient on a CPU. Dramatically faster training can be achieved by making use of one or more GPUs, coupled with the need to train more complex models on increasingly larger datasets, training on a CPU is not sufficient. Hops Hadoop is an Hadoop distribution that aims to make Hadoop more scalable, by migrating the meta-data of YARN and HDFS to MySQL NDB. Hops is currently making efforts to support distributed TensorFlow. However, GPUs are not currently managed natively by YARN, therefore, in Hadoop, GPUs cannot be scheduled to applications. That is, there is no support for isolating GPUs to applications and managing access to applications. This thesis presents an architecture for scheduling and isolating GPUs-as-aresource for Hops Hadoop. In particular, the work is constrained to supporting YARN’s most popular scheduler, the Capacity Scheduler. The architecture is implemented and verified based on a set of test cases. The architecture is evaluated in a quantitative approach by measuring the performance overhead of supporting GPUs by running a set of experiments. The solution piggybacks GPUs during capacity calculation in the sense that GPUs are not included as part of the capacity calculation. The Node Manager makes use of Cgroups to provide exclusive isolation of GPUs to a container. A GPUAllocator component is implemented that keeps an in-memory state of currently available GPU devices and those which are currently allocated locally on the Node Manager. The performance tests concluded that the YARN resource model can be extended with GPUs, and that the overhead is negligible. With our contribution of extending Hops Hadoop to support GPUs as a resource, we are enabling deep learning on Big Data, and making a first step towards support for distributed deep learning.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)