Using NLP Techniques for Log Analysis to Recommend Activities For Troubleshooting Processes

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Author: Martin Sköld; [2020]

Keywords: ;

Abstract: Continuous Integration is the practice of building and testing software every time a code change is merged into its entire codebase. At the merge, the source code is compiled, dependencies are resolved, and test cases are executed. Detecting a fault at an early stage implies that fewer resources need to be spent to find the fault since fewer merges need to be checked for errors. In this work, we analyze a dataset that comes from a Ericsson Continuous Integration flow that executes test cases daily. We create models to efficiently classify log events of interest in logs from failing test cases. For all models, each word in the log events is exchanged with the corresponding word embedding. The embeddings come from the FastText Continuous Bag of Words and Skip-gram models that use character n-grams for each word. For Linear Regression, Random Forest, XGBoost model, Support Vector Machine, and Multi-layer Perceptron, the word embeddings of the words of the log event is merged by weighting the words with the corresponding frequency-inverse document frequency from the dataset. The best performance was achieved with XGBoost, with a mean F1-score of 0:932 and a standard deviation of 0:034 when evaluating 100 3-fold cross-validations with different seeds. The LSTM model, which takes sequential input, got a mean F1-score of 0:896 and a standard deviation of 0:061. These results demonstrate the suitability of our approach to facilitating log analysis and defects detection tasks, reducing time and effort from developers. 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)