Low-Resource Domain Adaptation for Jihadi Discourse : Tackling Low-Resource Domain Adaptation for Neural Machine Translation Using Real and Synthetic Data

University essay from Uppsala universitet/Institutionen för lingvistik och filologi

Abstract: In this thesis, I explore the problem of low-resource domain adaptation for jihadi discourse. Due to the limited availability of annotated parallel data, developing accurate and effective models in this domain poses a challenging task. To address this issue, I propose a method that leverages a small in-domain manually created corpus and a synthetic corpus created from monolingual data using back-translation. I evaluate the approach by fine-tuning a pre-trained language model on different proportions of real and synthetic data and measuring its performance on a held-out test set. My experiments show that fine-tuning a model on one-fifth real parallel data and synthetic parallel data effectively reduces occurrences of over-translation and bolsters the model's ability to translate in-domain terminology. My findings suggest that synthetic data can be a valuable resource for low-resource domain adaptation, especially when real parallel data is difficult to obtain. The proposed method can be extended to other low-resource domains where annotated data is scarce, potentially leading to more accurate models and better translation of these domains.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)