Unsupervised Normalisation of Historical Spelling : A Multilingual Evaluation

University essay from Uppsala universitet/Institutionen för lingvistik och filologi

Abstract: Historical texts are an important resource for researchers in the humanities. However, standard NLP tools typically perform poorly on them, mainly due to the spelling variations present in such texts. One possible solution is to normalise the spelling variations to equivalent contemporary word forms before using standard tools. Weighted edit distance has previously been used for such normalisation, improving over the results of algorithms based on standard edit distance. Aligned training data is needed to extract weights, but there is a lack of such data. An unsupervised method for extracting edit distance weights is therefore desirable. This thesis presents a multilingual evaluation of an unsupervised method for extracting edit distance weights for normalisation of historical spelling variations. The model is evaluated for English, German, Hungarian, Icelandic and Swedish. The results are mixed and show a high variance depending on the different data sets. The method generally performs better than normalisation basedon standard edit distance but as expected does not quite reach up to the results of a model trained on aligned data. The results show an increase in normalisation accuracy compared to standard edit distance normalisation for all languages except German, which shows a slightly reduced accuracy, and Swedish, which shows similar results to the standard edit distance normalisation.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)