Investigating Machine Learning Clustering Methods to Replicate the Human Idea of Structure to Documents

University essay from Lunds universitet/Matematik LTH

Abstract: Anyone trying to maintain a set of text documents in an information retrieval system will run into problems keeping it relevant and up to date as the amount of data increases. This thesis investigates how a collection of documents can be clustered in a way that resembles how a human would organize it. It also assesses how difficult it is to implement this into an existing information retrieval system with current programming libraries, and in what practical ways this can be useful. The text data in this project is represented by a TF-IDF model. A K-Means clustering algorithm generates one clustering, and a Support Vector Machine is trained with minimal user data to provide another clustering. These two are then evaluated and compared using a set of metrics. This project takes a practical approach to the problem, focusing on what can be implemented using existing programming libraries and what will actually work in a production environment. Software for visualizing the corpus and calculating similar documents, are implemented as well. The supervised method SVM greatly surpasses the unsupervised method K-Means in being able to replicate the given ground truth, but both models are in themselves useful. With a relatively simple understanding of machine learning, any company could set up a similar system. It does, however, take some deeper mathematical knowledge and fine tuning to get the most out of it and tailor it to the dataset.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)