Comparative Analysis of Transformer and CNN Based Models for 2D Brain Tumor Segmentation

University essay from Linköpings universitet/Institutionen för medicinsk teknik

Abstract: A brain tumor is an abnormal growth of cells within the brain, which can be categorized into primary and secondary tumor types. The most common type of primary tumors in adults are gliomas, which can be further classified into high-grade gliomas (HGGs) and low-grade gliomas (LGGs). Approximately 50% of patients diagnosed with HGG pass away within 1-2 years. Therefore, the early detection and prompt treatment of brain tumors are essential for effective management and improved patient outcomes.  Brain tumor segmentation is a task in medical image analysis that entails distinguishing brain tumors from normal brain tissue in magnetic resonance imaging (MRI) scans. Computer vision algorithms and deep learning models capable of analyzing medical images can be leveraged for brain tumor segmentation. These algorithms and models have the potential to provide automated, reliable, and non-invasive screening for brain tumors, thereby enabling earlier and more effective treatment. For a considerable time, Convolutional Neural Networks (CNNs), including the U-Net, have served as the standard backbone architectures employed to address challenges in computer vision. In recent years, the Transformer architecture, which already has firmly established itself as the new state-of-the-art in the field of natural language processing (NLP), has been adapted to computer vision tasks. The Vision Transformer (ViT) and the Swin Transformer are two architectures derived from the original Transformer architecture that have been successfully employed for image analysis. The emergence of Transformer based architectures in the field of computer vision calls for an investigation whether CNNs can be rivaled as the de facto architecture in this field.  This thesis compares the performance of four model architectures, namely the Swin Transformer, the Vision Transformer, the 2D U-Net, and the 2D U-Net which is implemented with the nnU-Net framework. These model architectures are trained using increasing amounts of brain tumor images from the BraTS 2020 dataset and subsequently evaluated on the task of brain tumor segmentation for both HGG and LGG together, as well as HGG and LGG individually. The model architectures are compared on total training time, segmentation time, GPU memory usage, and on the evaluation metrics Dice Coefficient, Jaccard Index, precision, and recall. The 2D U-Net implemented using the nnU-Net framework performs the best in correctly segmenting HGG and LGG, followed by the Swin Transformer, 2D U-Net, and Vision Transformer. The Transformer based architectures improve the least when going from 50% to 100% of training data. Furthermore, when data augmentation is applied during training, the nnU-Net outperforms the other model architectures, followed by the Swin Transformer, 2D U-Net, and Vision Transformer. The nnU-net benefited the least from employing data augmentation during training, while the Transformer based architectures benefited the most.  In this thesis we were able to perform a successful comparative analysis effectively showcasing the distinct advantages of the four model architectures under discussion. Future comparisons could incorporate training the model architectures on a larger set of brain tumor images, such as the BraTS 2021 dataset. Additionally, it would be interesting to explore how Vision Transformers and Swin Transformers, pre-trained on either ImageNet- 21K or RadImageNet, compare to the model architectures of this thesis on brain tumor segmentation. 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)