Comparing Feature Extraction Methods and Effects of Pre-Processing Methods for Multi-Label Classification of Textual Data

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract:  This thesis aims to investigate how different feature extraction methods applied to textual data affect the results of multi-label classification. Two different Bag of Words extraction methods are used, specifically the Count Vector and the TF-IDF approaches. A word embedding method is also investigated, called the GloVe extraction method. Multi-label classification can be useful for categorizing items, such as pieces of music or news articles, that may belong to multiple classes or topics. The effect of using different pre-processing methods is also investigated, such as the use of N-grams, stop-word elimination, and stemming. Two different classifiers, an SVM and an ANN, are used for multi-label classification using a Binary Relevance approach. The results indicate that the choice of extraction method has a meaningful impact on the resulting classifications, but that no one method consistently outperforms the others. Instead the results show that the GloVe extraction method performs the best for the recall metrics, while the Bag of Words methods perform the best for the precision metrics.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)