Self-Supervised Fine-Tuning of sentence embedding models using a Smooth Inverse Frequency model : Automatic creation of labels with Smooth Inverse Frequency model

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: Sentence embedding models play a key role in the field of Natural Language Processing. They can be exploited for the resolution of several tasks like sentence paraphrasing, sentence similarity, and sentence clustering. Fine-tuning pre-trained models for sentence embedding extraction is a common practice that allows it to reach state-of-the-art performance on downstream tasks. Nevertheless, this practice usually requires labeled data sets. This thesis project aims to overcome this issue by introducing a novel technique for the automatic creation of a target set for fine-tuning sentence embedding models for a specific downstream task. The technique is evaluated on three distinct tasks: sentence paraphrasing, sentence similarity, and sentence clustering. The results demonstrate a significant improvement in sentence embedding models when employing the Smooth Inverse Frequency technique for automatic extraction and labeling of sentence pairs. In the paraphrasing task, the proposed technique yields a noteworthy enhancement of 2.3% in terms of F1-score compared to the baseline results. Moreover, it showcases a 0.2% improvement in F1-score when compared to the ideal scenario where real labels are utilized. For the sentence similarity task, the proposed method achieves a Pearson score of 0.71, surpassing the baseline model’s score of 0.476. However, it falls short of the ideal model trained with human annotations, which attains a Pearson score of 0.845. Regarding the clustering task, from a quantitative standpoint, the best model achieves a harmonic mean (calculated using DBCV and cophenetic score) of 0.693, outperforming the baseline score of 0.671. Nevertheless, the qualitative assessment did not demonstrate a substantial improvement for the clustering task, highlighting the need for exploring alternative techniques to enhance performance in this area.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)