A graphotactic language metric

University essay from KTH/Matematik (Inst.)

Author: Joar Bagge; [2013]

Keywords: ;

Abstract: In this bachelor’s thesis, we try to classify and identify written human languages by studying the ordering of letters in text. Automatic language identification is of interest in areas such as text indexing, machine translation and natural language parsing. Eleven written languages which use the Latin alphabet are considered and modelled with a Markov chain on the letter level. Texts from the New Testament and Wikipedia are used as training data. The distances between the languages are then measured by using a matrix-based metric on the transition matrices, and visualized in a dendrogram. A probability-based distance measure is also used. The matrix-based metric is then applied to language identification by creating a transition matrix for the text whose language is to be identified, and comparing the distances from this matrix to those of the known languages; the shortest distance indicates the language of the text. This is compared with maximum-likelihood classification. We compare metrics based on different matrix norms, and also study how the order of the Markov chains and the size of the training data and sample texts for language identification influence the results. The results indicate that the choice of matrix norm is important and that the Frobenius norm and the 1-norm are the best norms for language classification and language identification. Using these, it is possible to generate satisfactory dendrograms, and accurately identify the language of reasonably large texts. On the other hand, the 1-norm cannot be recommended in this context; an explanation is given for its bad performance. Some languages are easier to classify correctly than others; the Scandinavian languages are easy to group together, as are Spanish, Portuguese and Italian. However, English, French, German and Finnish are harder to classify correctly. Keywords: Written human languages, Language classification, Language identification, Markov chain model, Matrix norms, Statistical analysis of text.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)