Efficient Data Stream Sampling on Apache Flink

University essay from KTH/Skolan för datavetenskap och kommunikation (CSC)

Abstract: Sampling is considered to be a core component of data analysis making it possibleto provide a synopsis of possibly large amounts of data by maintainingonly subsets or multisubsets of it. In the context of data streaming, an emergingprocessing paradigm where data is assumed to be unbounded, samplingoffers great potential since it can establish a representative bounded view ofinfinite data streams to any streaming operations. This further unlocks severalbenefits such as sustainable continuous execution on managed memory, trendsensitivity control and adaptive processing tailored to the operations that consumedata streams.The main aim of this thesis is to conduct an experimental study in order tocategorize existing sampling techniques over a selection of properties derivedfrom common streaming use cases. For that purpose we designed and implementeda testing framework that allows for configurable sampling policiesunder different processing scenarios along with a library of different samplersimplemented as operators. We build on Apache Flink, a distributed streamprocessing system to provide this testbed and all component implementationsof this study. Furthermore, we show in our experimental analysis that there isno optimal sampling technique for all operations. Instead, there are differentdemands across usage scenarios such as online aggregations and incrementalmachine learning. In principle, we show that each sampling policy trades offbias, sensitivity and concept drift adaptation, properties that can be potentiallypredefined by different operators.We believe that this study serves as the starting point towards automatedadaptive sampling selection for sustainable continuous analytics pipelines thatcan react to stream changes and thus offer the right data needed at each time,for any possible operation

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)