Fine-Tuning Pre-Trained Language Models for CEFR-Level and Keyword Conditioned Text Generation : Acomparison between Google’s T5 and OpenAI’s GPT-2

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: This thesis investigates the possibilities of conditionally generating English sentences based on keywords-framing content and different difficulty levels of vocabulary. It aims to contribute to the field of Conditional Text Generation (CTG), a type of Natural Language Generation (NLG), where the process of creating text is based on a set of conditions. These conditions include words, topics, content or perceived sentiments. Specifically, it compares the performances of two well-known model architectures: Sequence-toSequence (Seq2Seq) and Autoregressive (AR). These are applied to two different tasks, individual and combined. The Common European Framework of Reference (CEFR) is used to assess the vocabulary level of the texts. In the absence of openly available CEFR-labelled datasets, the author has developed a new methodology with the host company to generate suitable datasets. The generated texts are evaluated on accuracy of the vocabulary levels and readability using readily available formulas. The analysis combines four established readability metrics, and assesses classification accuracy. Both models show a high degree of accuracy when classifying texts into different CEFR-levels. However, the same models are weaker when generating sentences based on a desired CEFR-level. This study contributes empirical evidence suggesting that: (1) Seq2Seq models have a higher accuracy than AR models in generating English sentences based on a desired CEFR-level and keywords; (2) combining Multi-Task Learning (MTL) with instructiontuning is an effective way to fine-tune models on text-classification tasks; and (3) it is difficult to assess the quality of computer generated language using only readability metrics.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)