Sort Merge Buckets: Optimizing Repeated Skewed Joins in Dataflow
Abstract: The amount of data being generated and consumed by today’s systems and applications is staggering and increasing at a vertiginous rate. Many businesses and entities rely on the analysis and the insights gained from this data to deliver their service. Due to the massive scale of this data, it is not possible to process it on a single machine, requiring instead parallel processing on multiple workers through horizontal scaling. However, even simple operations become complicated in a parallel environment. One such operation are joins, used widely in order to connect data by matching on the value of a shared key. Data-intensive platforms are used in order to make it easier to perform this and other operations at scale. In 2004, MapReduce was presented, revolutionizing the field by introducing a simpler programming model and a fault-tolerant and scalable execution framework. MapReduce’s legacy went on to inspire many processing frameworks, including contemporary ones such as Dataflow, used in this work. The Dataflow programming model (2015) is a unified programming model for parallel processing of data-at-rest and data-in-motion. Despite much work going into optimizing joins in parallel processing, few tackle the problem from a data perspective rather than an engine perspective, tying solutions to the execution engine. The reference implementation of Dataflow, Apache Beam, abstracts the execution engine away, requiring solutions that are platformindependent. This work addresses the optimization of repeated joins, in which the same operation is repeated multiple times by different consumers, e.g., user-specific decryption. These joins might also be skewed, creating uneven work distribution among the workers with a negative impact on performance. The solution introduced, sort merge buckets, is tested on Cloud Dataflow, the platform that implements the eponymous model, achieving promising results compared to the baseline both in terms of compute resources and network traffic. Sort merge buckets uses fewer CPU resources after two join operations and shuffles fewer data after four, for non-skewed inputs. Skew-adjusted sort merge buckets is robust to all types and degrees of skewness tested, and is better than a single join operation in cases of extreme skew.
AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)