A Document Recommender Based on Word Embedding

University essay from KTH/Skolan för elektro- och systemteknik (EES)


With the booming development of information technology, text information is not only remained in paper-based forms, but also in digital forms which have been distributed all over internet. Massive information on the internet provides us so many options while at the same time makes it hard for us to choose which detail information we exactly need. The appearance of media monitoring is going to change the situation and help solve the problem. Meltwater group as a media monitoring company provides a service of tracking and sorting information to enterprises and help them to achieve business goals. These goals may include finding the best time or place to do business campaign and knowing the dynamic information about the competitors.

There is a recommender system in Meltwater. When a query has been searched, the corresponding documents which are searched from the database will be presented. The problem for the system is that some of the documents have beenturned out to be misclassified and the correctness rate for the recommendation isnot that high. To help solve this problem and make the search better, this paper will introduce a new algorithm which is based on word embedding approach and users’ supervision. The background information of Meltwater group and its existing frame of recommender system will be specifically illustrated at the beginning of the paper. Followed by it will be the exploration of background methods which include LSA (Latent Semantic Analysis), Random Indexing and Word2vec. Besides, the necessary tools such as T-SNE, K-means clustering and hierarchy clustering will also be mentioned in this part.

The data sets that are going to be used in this paper will be described after thepart of background methods. Information such as the introduction of the data and the dealing of it will be mentioned in a detail way. The description of the algorithm will appear in the middle of the paper with detail steps. Followed by it is the evaluation. The algorithm will be evaluated by using several different data sets and the confusion matrix will be used as a means of measurement. Finally, a summary of the method as well as future suggestions will be made at the end of the paper.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)