Identifying Linguistic Variants in Middle Low German : Using tools from Data Science and Natural Language Processing to automate the analysis of Middle Low German texts

University essay from Uppsala universitet/Institutionen för informationsteknologi

Author: Lovisa Eriksson; William Brenham Hooper; [2023]

Keywords: ;

Abstract: During the 15th century, the printing press was introduced to Germany, and Europe as a whole. It has long been believed that standardization of the written language came as a direct consequence of this invention, but the studies supporting this theory are small in scale. To make a larger study more feasible, we study methods to automate parts of the research process, by providing tools and methods to create frequency lists, lists of collocations, summaries and analysis of abbreviation distribution, and a tool to make the further linguistic analysis more streamlined. To create the frequency lists, open compounds and words broken off at line breaks are handled by a novel method, using rules and probability estimates based on frequencies of words and word sequences in the Reference Corpus Middle Low German/Low Rhenish (1200-1650). The method correctly classifies 99.5% of word pairs, with inaccuracy primarily stemming from unique spelling variants at line breaks.  For identifying collocations, several measures of association are applied. Variants of these measures for three word collocations and variants with smoothing applied are presented, the latter of which to solve the issue of overconfidence of the established measures in short texts. The proposed methods show promising results, but they can not be fully evaluated due to the lack of test sets. Abbreviation usage in the texts is summarized using histograms and other plots resembling the pages, giving an easily interpretable result. The graphs show that abbreviations mainly occur towards the right margin, but in some texts they are more dense at the left or the bottom of the page as well. To cluster pages and quires based on abbreviation usage, k-means and HDBScan are applied, but neither of them gives clear, logical clusters, suggesting that abbreviation distribution does not depend significantly on the typesetter, or the pages position in a quire. To streamline the general research process of Variational Linguistics, a new tool is presented. It has clear similarities with current tools, but brings several key improvements: allowing multiple variables to be simultaneously examined, adding additional search special characters to increase efficiency, introducing an additional examination step to remove false positives in bulk, and directly showing the page numbers where the found variants occur.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)