A performance study for autoscaling big data analytics containerized applications : Scalability of Apache Spark on Kubernetes

University essay from Blekinge Tekniska Högskola/Institutionen för datavetenskap

Abstract: Container technologies are rapidly changing how distributed applications are executed and managed on cloud computing resources. As containers can be deployed on a large scale, there is a tremendous need for Container Orchestration tools like Kubernetes that are highly automatic in deployment, scaling, and management. In recent times, the adoption of these container technologies like Docker has seen a rise in internal usage, commercial offering, and various application fields ranging from High-Performance Computing to Geo-distributed (Edge or IoT) applications. Big Data analytics is another field where there is a trend to run applications (e.g., Apache Spark) as containers for elastic workloads and multi-tenant service models by leveraging various container orchestration tools like Kubernetes. Despite the abundant research on the performance impact of containerizing big data applications, to the best of our knowledge, the studies that focus on specific aspects like scalability and resource management are largely unexplored, which leaves a research gap to study upon. This research studies the performance impact of autoscaling a big data analytics application on Kubernetes based on autoscaling mechanisms like Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA). These state-of-art autoscaling mechanisms available for scaling containerized applications on Kubernetes and the available big data benchmarking tools for generating workload on frameworks like Spark are identified through a literature review. Apache Spark is selected as a representative big data application due to its ecosystem and industry-wide adoption by enterprises. In particular, a series of experiments are conducted by adjusting resource parameters (such as CPU requests and limits) and autoscaling mechanisms to measure run-time metrics like execution time and CPU utilization. Our experiment results show that while Spark performs better execution time when configured to scale with VPA, it also exhibits overhead in CPU utilization. In contrast, the impact of autoscaling big data applications using HPA adds overhead in terms of both execution time and CPU utilization. The research from this thesis can be used by researchers and other cloud practitioners, using big data applications to evaluate autoscaling mechanisms and derive better performance and resource utilization.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)