Spark on Kubernetes using HopsFS as a backing store : Measuring performance of Spark with HopsFS for storing and retrieving shuffle files while running on Kubernetes

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: Data is a raw list of facts and details, such as numbers, words, measurements or observations that is not useful for us all by itself. Data processing is a technique that helps to process the data in order to get useful information out of it. Today, the world produces huge amounts of data that can not be processed using traditional methods. Apache Spark (Spark) is an open-source distributed general-purpose cluster computing framework for large scale data processing. In order to fulfill its task, Spark uses a cluster of machines to process the data in a parallel fashion. External shuffle service is a distributed component of Apache Spark cluster that provides resilience in case of a machine failure. A cluster manager helps spark to manage the cluster of machines and provide Spark with the required resources to run the application. Kubernetes is a new cluster manager that enables Spark to run in a containerized environment. However, running external shuffle service is not possible while running Spark using Kubernetes as the resource manager. This highly impacts the performance of Spark applications due to the failed tasks caused by machine failures. As a solution to this problem, the open source Spark community has developed a plugin that can provide the similar resiliency as provided by the external shuffle service. When used with Spark applications, the plugin asynchronously back-up the data onto an external storage. In order not to compromise the Spark application performance, it is important that the external storage provides Spark with a minimum latency. HopsFS is a next generation distribution of Hadoop Distributed Filesystem (HDFS) and provides special support to small files (<64 KB) by storing them in a NewSQL database and thus enabling it to provide lower client latencies. The thesis work shows that HopsFS provides 16% higher performance to Spark applications for small files as compared to larger ones. The work also shows that using the plugin to back-up Spark data on HopsFS can reduce the total execution time of Spark applications by 20%-30% as compared to recalculation of tasks in case of a node failure.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)