Evaluation of Oversampling Methods For Artificial Neural Network Classification of Lung Cancer

University essay from KTH/Datavetenskap

Author: Alexander Söderhäll; David Cederström; [2022]

Keywords: ;

Abstract: New methods of assessing lung cancer (LC) risk is being researched. Gregory R. Hart et. al [15] developed an artificial neural network (ANN) that used many features related to LC risk. They showed an ANN could be used to determine a participants risk of LC by answering simple questions related to health with good results. Their dataset was an imbalanced binary dataset which meant that they faced an imbalanced binary classification problem which commonly reduces performance of an ANN. A solution being oversampling, this thesis is set out with one research question: What effect does oversampling have on LC risk prediction, using an artificial neural network, when trained on an imbalanced binary dataset? The dataset had a 1:796 ratio of participants classified as having LC to healthy individuals. The results of three oversampling methods were compared to no oversampling when trained on an artificial neural network used for LC risk prediction. The results were taken from the best found settings for all oversampling methods. We showed that Random Oversampling (ROS) and Synthethic Minority Oversampling Technique (SMOTE) increased performance metrics commonly used for imbalanced binary dataset classification assessment. Furthermore, the AUROC score was shown to be statistically significant for these two oversampling methods compared to using no oversampling. The results for Synthethic Minority Oversampling Technique for Nominal and Continuous (SMOTE-NC) showed no significant effect, however a detrimental trend to common performance metrics could be seen compared to no oversampling.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)