Analysis of Remarks Using Clustering and Keyword Extraction : Clustering Remarks on Electrical Installations and Identifying the Clusters by Extracting Keywords

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: Nowadays it is common for companies to sit on and gather a lot of data related to their business. The size of this data is often too large to be analyzed by hand and it is therefore becoming more and more common to automate this analysis e.g. by running machine learning methods on this data. In this project we attempt at analyzing an unstructured dataset consisting of remarks, found by inspectors, on electrical installations. This is done by firstly clustering the dataset with the goal of having each cluster representing a specific type of error found in the field and then extracting ten keywords from each cluster. We investigate whether these keywords can be used for representing the clusters’ contents in a way that could be useful for a future end-user application. The solution developed in this project was evaluated by constructing a form where the respondents were shown example remarks from a random subset of clusters and got to evaluate both how well the extracted keywords matched the examples and to what degree the example remarks from the same cluster represented the same kind of error. We got a total of 22 responses consisting of 8 professional inspectors and 14 laymen. Our results show that the keyword extraction make sense in connection to the example remarks from the form and that the keywords show promise in describing the content of a cluster. Also, for a majority of the clusters a clear consensus can be seen between the respondents on what keywords they considered as relevant. However the average number of keywords that the respondents considered relevant for each remark (1.40) was deemed too low for us to be able to recommend the solution. Additionally the clustering quality follows the same pattern in showing promise but not quite giving satisfactory results in this study. For future work a larger study should be conducted where several combinations of clustering and keyword extraction methods could be evaluated more thoroughly to be able to draw more decisive conclusions.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)