Detecting Security Patches in Java OSS Projects Using NLP

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: The use of Open Source Software is becoming more and more popular, but it comes with the risk of importing vulnerabilities in private codebases. Security patches, providing fixes to detected vulnerabilities, are vital in protecting against cyber attacks, therefore being able to apply all the security patches as soon as they are released is key. Even though there is a public database for vulnerability fixes the majority of them remain undisclosed to the public, therefore we propose a Machine Learning algorithm using NLP to detect security patches in Java Open Source Software. To train the model we preprocessed and extract patches from the commits present in two databases provided by Debricked and a public one released by Ponta et al. [57]. Two experiments were conducted, one performing binary classification and the other trying to have higher granularity classifying the macro-type of vulnerability. The proposed models leverage the structure of the input to have a better patch representation and they are based on RNNs, Transformers and CodeBERT [22], with the best performing model being the Transformer that surprisingly outperformed CodeBERT. The results show that it is possible to classify security patches but using more relevant pre-training techniques or tree-based representation of the code might improve the performance.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)