Domain Knowledge and Representation Learning for Centroid Initialization in Text Clustering with k-Means : An exploratory study

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: Text clustering is a problem where texts are partitioned into homogeneous clusters, such as partitioning them based on their sentiment value. Two techniques to address the problem are representation learning, in particular language representation models, and clustering algorithms. The state-ofthe-art language models are based on neural networks, in particular the Transformer architecture, and the models are used to transform a text into a point in a high dimensional vector space. The texts are then clustered using a clustering algorithm, and a recognized partitional clustering algorithm is k-Means. Its goal is to find centroids that represent the clusters (partitions) by minimizing a distance measure. Two influential parameters of k-Means are the number of clusters and the initial centroids. Multiple heuristics exist to decide how the parameters are selected. The heuristic of using domain knowledge is commonly used when it is available, e.g., the number of clusters is set to the number of dataset labels. This project further explores this idea. The main contribution of the thesis is an investigation of domain knowledge and representation learning as a heuristic in centroid initialization applied to k-Means. Initial centroids were obtained by applying a representation learning technique on the dataset labels. The project analyzed a Swedish dataset with views towards different aspects of Swedish immigration and a Swedish translated movie review dataset using six Swedish compatible language models and two versions of k-Means. Clustering evaluation was measured using eight metrics related to cohesion, separation, external entropy and accuracy. The results show the proposed heuristic made a positive impact on the metrics. By employing the proposed heuristic, six out of eight metrics were improved compared to the baseline. The improvements were observed using six language models and k-Means on two datasets. Additionally, the evaluation metrics indicated that the proposed heuristic has opportunities for future improvements.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)