Analysis of Short Text Classification strategies using Out of-domain Vocabularies

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Author: Diego Roa; [2018]

Keywords: ;

Abstract: Short text classification has become an important task for the Natural Language Processing (NLP) community due to the rapidly growing amount of tweets, search queries, short reviews and descriptions in different contexts such as e-commerce, social media and internal Enterprise Resource Planning (ERP) systems. The brevity and sparsity of such text data represent challenges to build accurate classification models. In order to overcome these challenges, most of the approaches proposed in existing literature rely on the use of implicit or explicit text representations, but the effect of using such strategies in cases with uncommon vocabulary has not been analyzed. In this work, we conduct a series of analysis with the aim of understanding the performance, contribution and effect of using implicit, explicit and hybrid representations in the final classification results when the datasets contain a high percentage of unknown or rare vocabulary. The results show that classic approaches such as Bag of Words (BoW), used as an input of a Logistic Regression classifier, can be more suitable for this kind of task than other approaches that add semantics using external knowledge bases as part of the classification approach. Additionally, the use of pretrained word embeddings trained on external sources as a strategy to obtain implicit text representations clearly outperforms the use of embeddings trained with local sources, whereas the improvement of using explicit representations highly depends on the quality of concepts that can be obtained from external knowledge bases.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)