Managed Distributed TensorFlow with YARN : Enabling Large-Scale Machine Learning on Hadoop Clusters

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Author: Tobias Johansson; [2018]

Keywords: ;

Abstract: Apache Hadoop is the dominant open source platform for the storage and processing of Big Data. With the data stored in Hadoop clusters, it is advantageous to be able to run TensorFlow applications on the same cluster that holds the input data sets for training machine learning models. TensorFlow supports distributed executions where Deep Neural Networks can be trained utilizing a large amount of compute nodes. To configure and launch distributed TensorFlow applications manually is complex and impractical, and gets worse with more nodes. This project presents a framework that utilizes Hadoop’s resource manager YARN to manage distributed TensorFlow applications. The proposal is a native YARN application with one ApplicationMaster (AM) per job, utilizing the AM as a registry for discovery prior to job execution. Conforming TensorFlow code to the framework typically is about a few lines of code. In comparison to TensorFlowOnSpark, the user experience is very similar, and collected performance data indicates that there exists an advantage of running TensorFlow directly on YARN with no extra layer in between.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)