A compact language model for Swedish text anonymization

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Author: Victor Wiklund; [2020]

Keywords: ;

Abstract: The General Data Protection Regulation (GDPR) that came into effect in 2018 states that for personal information to be freely used for research and statistics it needs to be anonymized first. To properly anonymize a text one needs to identify thewords that carry personally identifying information such as names, locations and organizations. Named Entity Recognition (NER) is the task of detecting these kinds of words and in the last decade a lot of progress has been made on it. This progress can be largely attributed to machine learning, in particular the development of language models that are trained on vast amounts of textual data in the target language. These models are powerful but very computationally demanding to run, limiting their accessibility. ALBERT is a recently developed language model that manages to provide almost the same level of performance at only a fraction of the size. In this thesis we explore the use of ALBERT as a component in Swedish anonymization by combining the model with a one-layer BiLSTM classifier and testing it on the Stockholm-Umeå corpus. The results show that the system can separate personally identifying words from ordinary words 79.4% of the time and that the model performs the best when it comes to detecting names, with a F1-score of 87.7 percent. Looking at the average performance across eight categories we obtain a F1-score of 77.8% with five-fold cross-validation and 77.0 _ 0.2% on the test set with 95% confidence. We find that the system as-is could be used for the anonymization of some types of information, but would perhaps be better suited as an aid for a human controller. We discuss ways to enhance the performance of the system and conclude that ALBERT can be a useful component in Swedish anonymization, provided that it is optimized further.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)