Regularizing Vision-Transformers Using Gumbel-Softmax Distributions on Echocardiography Data

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: This thesis introduces an novel approach to model regularization in Vision Transformers (ViTs), a category of deep learning models. It employs stochastic embedded feature selection within the context of echocardiography video analysis, specifically focusing on the EchoNet-Dynamic dataset. The proposed method, termed Gumbel Vision-Transformer (G-ViT), combines ViTs and Concrete Autoencoders (CAE) to enhance the generalization of models predicting left ventricular ejection fraction (LVEF). The model comprises a ViT frame encoder for spatial representation and a transformer sequence model for temporal aspects, forming a Video ViT (V-ViT) architecture that, when used without feature selection, serves as a baseline on LVEF prediction performance. The key contribution lies in the incorporation of stochastic image patch selection in video frames during training. The CAE method is adapted for this purpose, achieving approximately discrete patch selections by sampling from the Gumbel-Softmax distribution, a relaxation of the categorical. The experiments conducted on EchoNetDynamic demonstrate a consistent and notable regularization effect. The G-ViT model, trained with learned feature selection, achieves a test R² of 0.66 outperforms random masking baselines and the full-input V-ViT counterpart with an R² of 0.63, and showcasing improved generalization in multiple evaluation metrics. The G-ViT is compared against recent related work in the application of ViTs on EchoNet-Dynamic, notably outperforming the application of Swin-transformers, UltraSwin, which achieved an R² of 0.59. Moreover, the thesis explores model explainability by visualizing selected patches, providing insights into how the G-ViT utilizes regions known to be crucial for LVEF prediction for humans. This proposed approach extends beyond regularization, offering a unique explainability tool for ViTs. Efficiency aspects are also considered, revealing that the G-ViT model, trained with a reduced number of input tokens, yields comparable or superior results while significantly reducing GPU memory and floating-point operations. This efficiency improvement holds potential for energy reduction during training.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)