Explainable Multimodal Fusion

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: Recently, there has been a lot of interest in explainable predictions, with new explainability approaches being created for specific data modalities like images and text. However, there is a dearth of understanding and minimal exploration in terms of explainability in the multimodal machine learning domain, where diverse data modalities are fused together in the model. In this thesis project, we look into two multimodal model architectures namely single-stream and dual-stream for the Visual Entailment (VE) task, which compromises of image and text modalities. The models considered in this project are UNiversal Image-TExt Representation Learning (UNITER), Visual-Linguistic BERT (VLBERT), Vision-and-Language BERT (ViLBERT) and Learning Cross-Modality Encoder Representations from Transformers (LXMERT). Furthermore, we conduct three different experiments for multimodal explainability by applying the Local Interpretable Model-agnostic Explanations (LIME) technique. Our results show that UNITER has the best accuracy among these models for the problem of VE. However, the explainability of all these models is similar.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)