Real-time Outlier Detection using Unbounded Data Streaming and Machine Learning

University essay from Luleå tekniska universitet/Datavetenskap

Abstract: Accelerated advancements in technology, the Internet of Things, and cloud computing have spurred an emergence of unstructured data that is contributing to rapid growth in data volumes. No human can manage to keep up with monitoring and analyzing these unbounded data streams and thus predictive and analytic tools are needed. By leveraging machine learning this data can be converted into insights which are enabling datadriven decisions that can drastically accelerate innovation, improve user experience, and drive operational efficiency. The purpose of this thesis is to design and implement a system for real-time outlier detection using unbounded data streams and machine learning. Traditionally, this is accomplished by using alarm-thresholds on important system metrics. Yet, a static threshold cannot account for changes in trends and seasonality, changes in the system, or an increased system load. Thus, the intention is to leverage machine learning to instead look for deviations in the behavior of the data not caused by natural changes but by malfunctions. The use-case driving the thesis forward is real-time outlier detection in a Content Delivery Network (CDN). The input data includes Http-error messages received by clients, and contextual information like region, cache domains, and error codes, to provide tailormade predictions accounting for the trends in the data. The outlier detection system consists of a data collection pipeline leveraging the technique of stream processing, a MiniBatchKMeans clustering model that provides online clustering of incoming data according to their similar characteristics, and an LSTM AutoEncoder that accounts for temporal nature of the data and detects outlier data points in the clusters. An important finding is that an outlier is defined as an abnormal amount of outlier data points all originating from the same cluster, not a single outlier data point. Thus, the alerting system will be implementing an outlier percentage threshold. The experimental results show that an outlier is detected within one minute from a cache break-down. This triggers an alert to the system owners, containing graphs of the clustered data to narrow down the search area of the cause to enable preventive action towards the prominent incident. Further results show that within 2 minutes from fixing the cause the system will provide feedback that the actions taken were successful. Considering the real-time requirements of the CDN environment, it is concluded that the short delay for detection is indeed real-time. Proving that machine learning is indeed able to detect outliers in unbounded data streams in a real-time manner. Further analysis shows that the system is more accurate during peakhours when more data is in circulation than during none peak-hours, despite the temporal LSTM layers. Presumably, an effect from the model needing to train on more data to better account for seasonality and trends. Future work necessary to put the outlier detection system in production thus includes more training to improve accuracy and correctness. Furthermore, one could consider implementing necessary functionality for a production environment and possibly adding enhancing features that can automatically avert incidents detected and handle the causes of them.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)