Context-aware Swedish Lexical Simplification : Using pre-trained language models to propose contextually fitting synonyms

University essay from Linköpings universitet/Institutionen för datavetenskap

Abstract: This thesis presents the development and evaluation of context-aware Lexical Simplification (LS) systems for the Swedish language. In total three versions of LS models, LäsBERT, LäsBERT-baseline, and LäsGPT, were created and evaluated on a newly constructed Swedish LS evaluation dataset. The LS systems demonstrated promising potential in aiding audiences with reading difficulties by providing context-aware word replacements. While there were areas for improvement, particularly in complex word identification, the systems showed agreement with human annotators on word replacements. The effects of fine-tuning a BERT model for substitution generation on easy-to-read texts were explored, indicating no significant difference in the number of replacements between fine-tuned and non-fine-tuned versions. Both versions performed similarly in terms of synonymous and simplifying replacements, although the fine-tuned version exhibited slightly reduced performance compared to the baseline model. An important contribution of this thesis is the creation of an evaluation dataset for Lexical Simplification in Swedish. The dataset was automatically collected and manually annotated. Evaluators assessed the quality, coverage, and complexity of the dataset. Results showed that the dataset had high quality and a perceived good coverage. Although the complexity of the complex words was perceived to be low, the dataset provides a valuable resource for evaluating LS systems and advancing research in Swedish Lexical Simplification. Finally, a more transparent and reader-empowering approach to Lexical Simplification isproposed. This new approach embraces the challenges with contextual synonymy and reduces the number of failure points in the conventional LS pipeline, increasing the chancesof developing a fully meaning-preserving LS system. Links to different parts of the project can be found here: The Lexical Simplification dataset: https://github.com/emilgraichen/SwedishLSdataset The lexical simplification algorithm: https://github.com/emilgraichen/SwedishLexicalSimplifier

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)