A Data-Driven Approach for Incident Handling in DevOps

University essay from Blekinge Tekniska Högskola/Institutionen för programvaruteknik

Abstract: Background: Maintaining system reliability and customer satisfaction in a DevOps environment requires effective incident management. In the modern day, due to increasing system complexity, several incidents occur daily. Incident prioritization and resolution are essential to manage these incidents and lessen their impact on business continuity. Prioritization of incidents, estimation of recovery time objective (RTO), and resolution times are traditionally subjective processes that rely more on the DevOps team’s competence. However, as the volume of incidents rises, it becomes increasingly challenging to handle them effectively.  Objectives: This thesis aims to develop an approach that prioritizes incidents and estimates the corresponding resolution times and RTO values leveraging machine learning. The objective is to provide an effective solution to streamline DevOps activities. To verify the performance of our solution, an evaluation is later carried out by the users in a large organization (Ericsson).  Methods: The methodology used for this thesis is design science methodology. It starts with the problem identification phase, where a rapid literature review is done to lay the groundwork for the development of the solution. Cross-Industry Standard Process for Data Mining (CRISP-DM) is carried out later in the development phase. In the evaluation phase, a static validation is carried out in a DevOps environment to collect user feedback on the tool’s usability and feasibility.  Results:  According to the results, the tool helps the DevOps team prioritize incidents and determine the resolution time and RTO. Based on the team’s feedback, 84% of participants agree that the tool is helpful, and 76% agree that the tool is easy to use and understand. The tool’s performance evaluation of the three metrics chosen for estimating the priority was accuracy 93%, Recall 78%, F1 score 87% on average for all four priority levels, and the BERT accuracy for estimating the resolution time range was 88%. Hence, we can expect the tool to help speed up the incident response’s efficiency and decrease the resolution time.  Conclusions: The tool’s validation and implementation indicate that it has the potential to increase the reliability of the system and the effectiveness of incident management in a DevOps setting. Prioritizing incidents and predicting resolution time ranges based on impact and urgency can enable the DevOps team to make well-informed decisions. Some of the future progression for the tool can be to investigate how to integrate it with other third-party DevOps tools and explore areas to provide guidelines to handle sensitive incident data. Another work could be to analyze the tool in a live project and obtain feedback. 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)