DistillaBSE: Task-agnostic  distillation of multilingual sentence  embeddings : Exploring deep self-attention distillation with switch transformers

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: The recent development of massive multilingual transformer networks has resulted in drastic improvements in model performance. These models, however, are so large they suffer from large inference latency and consume vast computing resources. Such features hinder widespread adoption of the models in industry and some academic settings. Thus there is growing research into reducing their parameter count and increasing their inference speed, with significant interest in the use of knowledge distillation techniques. This thesis uses the existing approach of deep self-attention distillation to develop a task-agnostic distillation of the language agnostic BERT sentence embedding model. It also explores the use of the Switch Transformer architecture in distillation contexts. The result is DistilLaBSE, a task-agnostic distillation of LaBSE used to create a 10 times faster version of LaBSE, whilst retaining over 99% cosine similarity of its sentence embeddings on a holdout test from the same domain as the training samples, namely the OpenSubtitles dataset. It is also shown that DistilLaBSE achieves similar scores when embedding data from two other domains, namely English tweets and customer support banking data. This faster version of LaBSE allows industry practitioners and resourcelimited academic groups to apply a more convenient version of LaBSE to their various applications and research tasks. 

