Impact of Cell Type Selection on Binary Classification of Cervical Cancer using Convolutional Neural Networks : A Compatibility Analysis of Herlev and SIPaKMeD

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Author: Elias Ram; Elias Stihl; [2023]

Keywords: ;

Abstract: Cervical cancer is one of the most common forms of cancer today, affecting women worldwide. Machine learning classifiers could potentially be utilized to aid in the diagnosis of cervical cancer, making screening more cost-effective. This thesis studies two cervical cancer datasets, Herlev and SIPaKMeD, and examines how discrepancies between the datasets affect the performance of a binary classifier based on a Convolutional Neural Network. Models trained on Herlev were tested on SIPaKMeD and vice versa to investigate the compatibility between the two. Also, different binary aggregations were formed by varying the cell types in the training and test data to investigate how the cell type selection affects performance. The models trained and tested on the same dataset achieved 92.35% and 98.93% accuracy for Herlev and SIPaKMeD respectively. In contrast to these baseline results, the model trained on Herlev yielded an accuracy of 66.51% when tested on SIPaKMeD. Similarly, the model trained on SIPaKMeD also showed a performance drop when tested on Herlev, reaching only 78.69% accuracy. Only when the datasets were modified to become more similar, in terms of cell types included, did the accuracies become comparable to the baseline results, reaching 90.12% for the model trained on Herlev and tested on SIPaKMeD, and 98.17% for the model trained on SIPaKMeD and tested on Herlev. These results show that the performance of a binary classifier is heavily affected by the selection of cell types in the training and test data. Thus, it is not sufficient to only regard the binary class when considering to use different data sources for training and test data. Furthermore, there is also a need for a more comprehensive dataset as the two predominant public datasets, Herlev and SIPaKMeD, have been shown to be incompatible with each other.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)