Exploring Short Text Clustering for Transactional Data

University essay from Uppsala universitet/Institutionen för informationsteknologi

Abstract: The digital revolution has led to an increase in digitization of transactional information. Due to the large amount of data, the transactions must be categorized such that an overview of spending can be obtained. To aid the process of manually classifying transactions, we consider a process of clustering short text transactional data as a pre-processing step. If clusters have high homogeneity, then entire clusters, and hence multiple transactions, can be classified at once. We explore two short text clustering methods, and evaluate them on real-world data in terms of execution time and clustering performance determined by domain experts. In the evaluations results, the clusterings exhibit poor intra-cluster similarity (i.e. homogeneity), and are deemed unusable. One of the algorithms is extremely slow, but this is likely due to insufficient memory capacity of the evaluation environment. We conclude that the chosen methods are unsuitable for our purposes and discuss the properties that other clustering techniques should have in order to be suitable. We also discuss non-clustering approaches that may be suitable.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)