A Framework for Using Deep Learning to Detect Software Vulnerabilities

University essay from Linköpings universitet/Institutionen för datavetenskapLinköpings universitet/Tekniska fakulteten

Abstract: In recent years, with the rise of Internet technology, software vulnerabilities have also flooded, making the software security of enterprises or individuals seriously threatened. Although it is difficult to avoid the occurrence of software vulnerabilities in the process of software development, it is also a way to find and modify the vulnerabilities as early as possible. At present, research on static vulnerability detection system can be divided into methods based on code similarity and pattern-based method. The method based on code similarity is mainly used to detect vulnerabilities caused by code cloning, while the vulnerabilities caused by other reasons have high false negative. Patterns-based approaches require experts to define vulnerability characteristics manually, which leads to a waste of time and effort. Besides, since defining characteristics is a subjective task, the judgement of experts will affect the results of detection. At this point, there is an urgent need for an approach that can detect vulnerabilities for various reasons and is less dependent on experts.            Deep learning is a new field of machine learning research, which has received extensive attention in recent years. Its use has greatly liberated human resources, which makes us think whether deep learning can also be applied to vulnerability detection research, and whether it can also solve the problem of waste of expert resources.            This thesis studies a software vulnerability detection framework based on deep learning. The main research contents are as follows:            1. Collect the source code of four types of software vulnerabilities in C/C++ (Function Call, Array Usage, Pointer Usage and Arithmetic Expression) as the dataset of the experiment in this thesis. Extract the vulnerability syntax characteristics of four kinds of software vulnerabilities, match the dataset with the vulnerability syntax characteristics, and generate syntax-based code fragments. Program slices for syntax-based code fragments are then generated and converted into semantic-based code fragments.             2. Data processing for semantic-based code fragments includes: replacing all strings in semantic-based code fragments with a unified string. And perform word segmentation on semantic-based code fragments. Then replace all user-defined variables in semantic-based code fragments and user-defined function names. Finally, the processed semantic-based code fragments are converted into vector representations.            3. According to the characteristics of software vulnerabilities, select deep learning methods suitable for text analysis: Long Short-Term Memory, Bi-directional Long Short-Term Memory, Gate Recurrent Unit, Bi-directional Gate Recurrent Unit. The four deep learning methods are designed and implemented to make the accuracy of software vulnerability detection as high as possible.             4. Select reasonable measurement methods to evaluate the framework, and compare it with other tools for detecting software vulnerabilities, to judge the effectiveness of the framework. 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)