Using unsupervised classification with multiple LDA derived models for text generation based on noisy and sensitive data

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Author: Lucas Ljungberg; [2019]

Keywords: ;

Abstract: Creating models to generate contextual responses to input queries is a difficult problem. It is even more difficult when available data contains noise and sensitive data. Finding models or methods to handle such issues is important in order to use data for productive means.This thesis proposes a model based on a cooperating pair of Topic Models of differing tasks (LDA and GSDMM) in order to alleviate the problematic properties of data. The model is tested on a real-world dataset with these difficulties as well as a dataset without them. The goal is to 1) look at the behaviour of the different topic models to see if their topical representation of the data is of use as input or output to other models and 2) find out what properties can be alleviated as a result.The results show that topic modeling can represent the semantic information of documents well enough to produce well-behaved input data for other models, which can also deal well with large vocabularies and noisy data. The topical clustering of the response data is sufficient enough for a classification model to predict the context of the response, from which valid responses can be created.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)