Utilizing user perceived latency as an indicator for system failure

University essay from KTH/Skolan för datavetenskap och kommunikation (CSC)

Author: Anton Lindström; [2014]

Keywords: ;

Abstract: Monitoring systems used in the industry mainly trigger alerts based on system metrics such as CPU usage, memory usage and disk space. There is a trend on also alerting on metrics related to end user experience. This study evaluates the use of playback latency in the Spotify music streaming service as an indicator for detecting system failures. Six months of playback log data together with tickets from a system incidents tracker were analyzed. The playback latency distribution was studied using a sliding window aggregation method. A cyclic pattern was found and two simple anomaly detection algorithms were then applied on the time series. The detected anomalies were matched together with tickets from the system incident tracker. In the most efficient algorithm, in terms of finding anomalies matching a ticket, the hit ratio was 57 %. However, since the system incident tracker was the only source for system failures, there was a possibility that unreported failures occurred. Metrics related to end user experience are in many cases business critical, and this motivates the need of monitoring playback latency, although it does not seem to be a silver bullet for finding system failures.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)