Text Curation for Clustering of Free-text Survey Responses

University essay from Linköpings universitet/Institutionen för datavetenskap

Abstract: When issuing surveys, having the option for free-text answer fields is only feasible where the number of respondents is small, as the work to summarize the answers becomes unmanageable with a large number of responses. Using NLP techniques to cluster these answers and summarize them would allow a greater range of survey creators to incorporate free-text answers in their survey, without making their workload too large. Academic work in this domain is sparse, especially for smaller languages such as Swedish. The Swedish company iMatrics is regularly hired to do this kind of summarizing, specifically for workplace-related surveys. Their method of clustering has been semiautomatic, where both manual preprocessing and postprocessing have been necessary to accomplish this task. This thesis aims to explore if using more advanced, unsupervised NLP text representation methods, namely SentenceBERT and Sent2Vec, can improve upon these results and reduce the manual work needed for this task. Specifically, three questions are to be answered. Firstly, do the methods show good results? Secondly, can they remove the time-consuming postprocessing step of combining a large number of clusters into a smaller number? Lastly, can a model where unsupervised learning metrics can be shown to correlate to the real-world usability of the model, thus indicating that these metrics can be used to optimize the model for new data? To answer these questions, several models are trained, employed, and then compared using both internal and external metrics: Sent2Vec, SentenceBERT, and traditional baseline models. A manual evaluation procedure is performed to assess the real-world usability of the clusterings looks like, to see how well the models perform as well as to see if there is any correlation between this result and the internal metrics for the clustering. The results indicate that improving the text representation step is not sufficient for fully automating this task. Some of the models show promise in the results of human evaluation, but given the unsupervised nature of the problem and the large variance between models, it is difficult to predict the performance of new data. Thus, the models can serve as an improvement to the workflow, but the need for manual work remains.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)