Kubernetes Automatic Geographical Failover Techniques

University essay from Luleå tekniska universitet/Institutionen för system- och rymdteknik

Abstract: With the rise of microservice architectures, there is a need for an orchestration tool to manage containers. Kubernetes has emerged as one of the most popular alternatives, adopting widespread usage. But managing multiple Kubernetes clusters on its own have proven to be a challenging task. This difficulty has given rise to multiple cloud based alternatives which help streamline the managing process of a cluster environment and helps maintain an extreme high availability environment that is hard to replicate in an on premise environment. Using these cloud based platforms for hosting and managing ones system is great, but alleviating control of a system to a cloud provider masquerades any illicit behaviour performed on or through the system. The scope of this thesis is on examining optional designs that will automate the process of executing a geographical failover between different locations to better sustain an on premise fault tolerant kubernetes environment. There already exists multiple tools in the area of kubernetes service mesh, but their focus is not primarily on increasing system resilience but to increase security, observability and performance. Linkerd is a sidecar oriented service mesh which supports geographical failover by manually announcing individual services between cluster(s) mirror gateways. Cilium offers an Container Networking Interface (CNI) which performs routing through eBPF and allows for seamless failover between clusters by managing cross cluster service endpoints. Both of the mentioned service mesh providers handle failover from inside the kubernetes cluster. The contributions includes two new peer-to-peer designs that focus on external cluster geographical failover - both designs are compatible with preexisting kubernetes clusters without internal modifications. A fully repli-cated design was then realised into a proof of concept (POC), and tested against a Cilium multi cluster environment on the metric of north to south traffic latency. Due to the nature of the underlying hardware, the tests showed that the POC can be used for external geographical failover and it showed potential performance capabilities in a limited lab scale. As the purpose of this thesis was not to determine the traffic throughput of a geographical failover solution; but to examine different approaches automatic geographical failover can be implemented, the tests were a success. Therefore, this thesis can conclude that there exists several working solutions, and the POC have shown that there are still undiscovered and unimplemented solutions to explore.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)