Swedish NLP Solutions for Email Classification

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Author: John Robert Castronuovo; [2020]

Keywords: ;

Abstract: Assigning categories to text communications is a common task of Natural Language Processing (NLP). In 2018, a new deep learning language repre- sentation model, Bidirectional Encoder Representations from Transformers (BERT), was developed which can make inferences from text without task specific architecture. This research investigated whether or not a version of this new model could be used to accurately classify emails as well as, or better than a classical machine learning model such as a Support Vector Machine (SVM). In this thesis project, a BERT model was developed by solely pre- training on the Swedish language (svBERT) and investigated whether it could surpass a multilingual BERT (mBERT) model’s performance on a Swedish email classification task. Specifically, BERT was used in a classification task for customer emails. Fourteen email categories were defined by the client. All emails were in the Swedish language. Three different SVMs and four different BERT models were all created for this task. The best F1 score for the three classical machine learning models (standard or hybrid) and the four deep learn- ing models was determined. The best machine learning model was a hybrid SVM using fastText with an F1 score of 84.33%. The best deep learning model, mPreBERT, achieved an F1 score of 85.16%. These results show that deep learning models can improve upon the accuracy of classical machine learning models and suggest that more extensive pre-training with a Swedish text corpus will markedly improve accuracy.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)