Finding Causal Relationships Among Metrics In A Cloud-Native Environment

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: Automatic Root Cause Analysis (RCA) systems aim to streamline the process of identifying the underlying cause of software failures in complex cloud-native environments. These systems employ graph-like structures to represent causal relationships between different components of a software application. These relationships are typically learned through performance and resource utilization metrics of the microservices in the system. To accomplish this objective, numerous RCA systems utilize statistical algorithms, specifically those falling under the category of causal discovery. These algorithms have demonstrated their utility not only in RCA systems but also in a wide range of other domains and applications. Nonetheless, there exists a research gap in the exploration of the feasibility and efficacy of multivariate time series causal discovery algorithms for deriving causal graphs within a microservice framework. By harnessing metric time series data from Prometheus and applying these algorithms, we aim to shed light on their performance in a cloudnative environment. Furthermore, we have introduced an adaptation in the form of an ensemble causal discovery algorithm. Our experimentation with this ensemble approach, conducted on datasets with known causal relationships, unequivocally demonstrates its potential in enhancing the precision of detected causal connections. Notably, our ultimate objective was to ascertain reliable causal relationships within Ericsson’s cloud-native system ’X,’ where the ground truth is unavailable. The ensemble causal discovery approach triumphs over the limitations of employing individual causal discovery algorithms, significantly augmenting confidence in the unveiled causal relationships. As a practical illustration of the utility of the ensemble causal discovery techniques, we have delved into the domain of anomaly detection. By leveraging causal graphs within our study, we have successfully applied this technique to anomaly detection within the Ericsson system.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)