A comparative study of the grammatical gender systems of languages by means of analysing word embeddings
Abstract: The creation of word embeddings is one of the key breakthroughs in natural language processing. Word embeddings allow for words to be represented semantically, opening the way to many new deep learning methods. Understanding what information is in word embeddings will help understanding the behaviour of embeddings in natural language processing tasks, but also allows for the quantitative study of the linguistic features such as grammatical gender. This thesis attempts to explore how grammatical gender is encoded in word embeddings, through analysing the performance of a neural network classifier on the classification of nouns by gender. This analysis is done in three experiments: an analysis of contextualized embeddings, an analysis of embeddings learned from modified corpora and an analysis of aligned embeddings in many languages. The contextualized word embedding model ELMo has multiple output layers with a gradual increasing presence of semantic information in the embedding. This differing presence of semantic information was used to test the classifier's reliance on semantic information. Swedish, German, Spanish and Russian embeddings were classified at all layers of a three layered ELMo model. The word representation layer without any contextualization was found to produce the best accuracy, indicating the noise introduced by the contextualization was more impactful than any potential extra semantic information. Swedish embeddings were learned from a corpus stripped of articles and a stemmed corpus. Both sets of embeddings showed an drop of about 6% in accuracy in comparison with the embeddings from a non-augmented corpus, indicating agreement plays a large role in the classification. Aligned multilingual embeddings were used to measure the accuracy of a grammatical gender classifier in 24 languages. The classifier models were applied to data of other languages to determine the similarity of the encoding of grammatical gender in these embeddings. Correcting the results with a random guessing baseline shows that transferred models can be highly accurate in certain language combinations and in some cases almost approach the accuracy of the model on its source data. A comparison between transfer accuracy and phylogenetic distance showed that the model transferability follows a pattern that resembles the phylogenetic distance.
AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)