Multi-modal Models for Product Similarity : Comparative evaluation of unimodal and multi-modal architectures for product similarity prediction and product retrieval

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: With the rapid growth of e-commerce, enabling effective product recommendation systems and improving product search for shoppers plays a crucial role in driving customer satisfaction. Traditional product retrieval approaches have mainly relied on unimodal models focusing on text data. However, to capture auxiliary context and improve the accuracy of similarity predictions, it is crucial to explore architectures that can leverage additional sources of information, such as images. This thesis compares the performance of multi- and unimodal methods for product similarity prediction and product retrieval. Both approaches are applied to two e-commerce datasets, one containing English and another containing Swedish product descriptions. A pre-trained multi-modal model called CLIP is used as a feature extractor. Different models are trained on CLIP embeddings using either text-only, image-only or image-text inputs. An extension of triplet loss with margins is tested, along with various training setups. Given the lack of similarity labels between products, product similarity prediction is studied by measuring the performance of a K-Nearest Neighbour classifier implemented on features extracted by the trained models. The thesis results demonstrate that multi-modal architectures outperform unimodal models in predicting product similarity. The same is true for product retrieval. Combining textual and visual information seems to lead to more accurate predictions than models relying on only one modality. The findings of this research have considerable implications for e-commerce platforms and recommendation systems, providing insights into the effectiveness of multi-modal models for product-related tasks. Overall, the study contributes to the existing body of knowledge by highlighting the advantages of leveraging multiple sources of information for deep learning. It also presents recommendations for designing and implementing effective multi-modal architectures.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)