Exploring Cross-lingual Sublanguage Classification with Multi-lingual Word Embeddings

University essay from Linköpings universitet/Statistik och maskininlärning

Author: Min-chun Shih; [2020]

Keywords: ;

Abstract: Cross-lingual text classification is an important task due to the globalization and the increased availability of multilingual data. This thesis explores the method of implementing cross-lingual classification on Swedish and English medical corpora. Specifically, this the- sis explores the simple convolutional neural network (CNN) with MUSE pre-trained word embeddings to approach binary classification of sublanguages (“lay” and “specialized”) from Swedish healthcare texts to English healthcare texts. MUSE is a library that provides state-of-the-art multilingual word embeddings and large-scale high-quality bilingual dictionaries. The thesis presents experiments with imbalanced and balanced class distribution on training data and test data to examine the effect of class distribution, and also examine the influences of clean test dataset and noisy test dataset. The results show that balanced distribution of classes in training data performs significantly better than the training data with imbalanced class distribution, and clean test data gives the benefit of transferring the labels from one language to another. The thesis also compares the performance of the simple convolutional neural network model with the Naive Bayes baseline. Results show that on this task a simple Naive Bayes classifier based on bag-of-words translated using MUSE English-Swedish dictionary outperforms a simple CNN model based on MUSE pre-trained word embeddings in several experimental settings.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)