Neural Cleaning of Swedish Textual Data : Using BERT-based methods for Token Classification of Running and Non-Running Text

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: Modern natural language processing methods requires big textual datasets to function well. A common method is to scrape the internet to acquire the needed data. This does, however, come with the issue that some of the data may be unwanted – for instance, spam websites. As a consequence, the datasets become larger and thus increasing training cost. This thesis defines text as written by humans as running text, and automatically generated texts as non-running text. The goal of the thesis was then to fine-tune the KB-BERT model, BERT pre-trained on Swedish textual data, to classify tokens as either running or non-running text. To do this, texts from the Swedish C4 corpus were manually annotated. In total, 1000 texts were annotated and used for the fine-tuning phase. As the annotated data was a bit skewed in favour of running-text, it was also tested how using class weights to balance the training data affected the end results. When using the BERT-based method with no class weights, the method got a precision and recall for non-running text of 95.13% and 78.84%, and for running text the precision and recall was 83.87% and 96.46%. When using class weights to balance the data, the precision and recall for non-running text were 90.08% and 87.4%, and for running text 89.36% and 92.40%. From these results, one can see that it is possible to alter how strict the model is depending on one’s needs, for instance, purpose and amount of available textual data by using class weights. The number of samples in the manually annotated dataset is too small to make a definite conclusion from, but this thesis shows that using a BERT-based method has the potential to handle problems such as these, as it produced much better results when compared to a more simple baseline-method. Therefore, further research in this area of natural language processing is encouraged.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)