Highly Available Task Scheduling in Distinctly Branched Directed Acyclic Graphs

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: Big data processing frameworks utilizing distributed frameworks to parallelize the computing of datasets have become a staple part of the data engineering and data science pipelines. One of the more known frameworks is Dask, a widely utilized distributed framework used for parallelizing data processing jobs. In Dask, the main component that traverses and plans out the execution of the job is the scheduler. Dask utilizes a centralized scheduling approach, having a single server node as the scheduler. With no failover mechanism implemented for the scheduler, the work in progress is potentially lost if the scheduler fails. As a consequence, jobs that might have been executed for hours or longer need to be restarted. In this thesis, a highly available scheduler is designed, based on Dask. We introduce a highly-available scheduler that replicates the state of the job on a distributed key-value store. The replicated schedulers allow us to design an architecture where the schedulers are able to take over the job in case of a scheduler failure. To reduce the performance overhead of replication, we further explore optimizations based on partitioning typical task graphs and sending each partition to its own scheduler. The results show that the replicated scheduler is able to tolerate server failures and is able to complete the job without restarting but at a cost of reduced throughput due to the replication. This is mitigated by our partitioning, which achieves almost linear performance gains relative to our baseline fault-tolerant scheduler, through the utilization of a parallelized scheduling architecture.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)