Minimizing Blast Radius of Chaos Engineering Experiments via Steady-State Metrics Forecasting

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: Chaos Engineering (CE) intentionally disrupts distributed systems by introducing faults into the system to better understand and improve their resilience. By studying these intentional disruptions, CE provides insights that help enhance system performance and the overall user experience. However, two main challenges exist: reducing the negative impact or ”blast radius” of these CE experiments without diluting the value of the CE experiment and identifying a standardized set of metrics to monitor during such CE experiments. This research addresses these challenges by monitoring application and system-level metrics known as the Golden Signals, and a steady-state metric called the Apdex score during a CE experiment. Using Pearson and Spearman correlation analyses alongside Granger Causality tests, a strong connection between the Golden Signals and Apdex score is identified. The study also introduces a new health-check system design that uses the Apdex score to automatically stop a CE experiment if a preset threshold is violated. Furthermore, the design also introduces a method for early termination of the CE experiment based on forecasted Apdex scores. This method not only limits potential system damage but also reveals key system weaknesses, striking a balance between risk and discovery.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)