BERT Language Modelling on Network Log Data for Generalized Unsupervised Intrusion Detection

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: Intrusion detection is the most prominent topic of modern computer network security. The potential attack surface is growing exponentially every year. To cope with the amounts of data which accrue, automated methods for detecting undesired network activity are the only feasible solution. Automated intrusion detection is a difficult problem, since there is no given set of rules pointing to one factor defining suspicious activity. On top of that, these systems need to be highly performant, due to the high quantitative nature of network activity. Even a small error rate could potentially result in either, critical events slipping by the defenses or else, a flood of irrelevant events being presented for human revision, overloading the administrator’s capacities. In this Thesis, we investigated if BERT (Bidirectional Encoder Representations from Transformers) like models have the potential to improve the state-of-the-art performance for intrusion detection, based on their remarkable results achieved in the field of natural language processing. We compare three variations of the BERT language model and investigate numerous ablations. Concluding, we discovered that BERT like models are not natively suitable for log-line data. The masked language modelling (MLM) training task at the core of BERT models was identified to not be compatible in its original form with the structure of log-line data. Advantages of the undirected processing, achieved by utilizing MLM, leading to the astonishing results in NLP are not applicable to log text data. Unlike natural languages, log-line data is highly structured, with all data from a single source following the same format. Therefore, BERT can not fully utilize its context sensitivity property, like on natural language data, indicating that the random masking aspect of MLM is ill-suited for this application on the given data. A possible way of improvement could be to modify the masking process or substitute the MLM task.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)