Enabling Arbitrary Memory Constraint Standing Queries on Distributed Stream Processors using Approximate Algorithms

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Author: Tobias Lindener; [2018]

Keywords: ;

Abstract: Relational algebra and SQL have been a standard in declarative analytics for decades. Yet, at web-scale, even simple analytics queries can prove challenging within Distributed Stream Processing environments. Two examples of such queries are "count" and "count distinct". Since aforementioned queries require persistence of all keys (the value identifying an element), such queries would result in continuously increasing memory demand. Through approximation techniques with fixed-size memory layouts, said tasks are feasible and potentially more resource efficient within streaming systems. Within this thesis, (1) the advantages of approximate queries on distributed stream processing are demonstrated. Furthermore, (2) the resource efficiency as well as (3) challenges of approximation techniques, and (4) dataset dependent optimizations are presented. The prototype is implemented using the Yahoo Data Sketch library on Apache Flink. Based on the evaluation results and the experiences with the prototype, potential improvements like deeper integration into the streaming framework are presented. Throughout the analysis, the combination of approximate algorithms and distributed stream processing shows promising results depending on the dataset and the required accuracy.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)