Comparing machine learning algorithms for detecting behavioural anomalies

University essay from Blekinge Tekniska Högskola/Institutionen för datavetenskap

Abstract: Background. Attempted intrusions at companies, either from an insider threat orotherwise, is increasing in frequency. Most commonly used is static analysis and filters to stop specific attacks. Utilising machine learning in order to detect behaviouralanomalies in the access flow of an isolated system can aid in detecting, and stopping, attacks faster than previous methods. Objectives. In this thesis, four algorithms were selected to be compared againsteach other using three different metrics. These metrics were chosen for their importance in an isolated domain. All algorithms will be trained on the same dataset, from which anomalies are created that are used to test each model. Methods. A dataset created for anomaly detection is preprocessed to fit the scenario that was explored. After which the dataset was split per user and only the user with the most samples was used for training the models. In order to test and evaluate the models, anomalies were forged from a profile created out of the metadata belonging to the chosen user. These anomalies, alongside a part of the benign samples were used to evaluate the F1 score of each model, which was compared. The better performing model according to the F1 score was then subjected to hyperparameter tuning to improve the performance further. Afterwards, the speed of which the model was loaded, and a single sample was predicted and the memory consumption of each action was measured. Results. The results showed that two algorithms were relatively close, all depending on the strictness of memory consumption. Local Outlier Factor, which used four times the memory (44 MB) of the other models, proved to be the better option when looking at F1 score, at 90.91% after having undergone hyperparameter tuning. However, Elliptic Envelope was a close second at 86.61% without undergoing hyperparameter tuning, while consuming less memory (11 MB) than the others. The speed of loading the models were 26.68 ms and 2.01 ms, with predicting one sample 1.87 ms and 0.38 ms respectively for the two models. The initial loading time is less important since it is only done once. Conclusions. Using this dataset, which albeit is not optimal, it showed that Local Outlier Factor was the best performing model, at a slightly higher memory con-sumption, while remaining accurate and relatively fast. However, it was also shown that depending on how strict the memory consumption is, Elliptic Envelope can be applicable as well considering its lower memory consumption.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)