Text Analysis - Exploring latent semantic models for information retrieval, topic modeling and sentiment detection

University essay from Chalmers tekniska högskola/Institutionen för data- och informationsteknik

Author: Adam Luotonen; Erik Jalsborn; [2011]

Keywords: ;

Abstract: With the increasing use of the Internet and social media, the amount of available data has exploded. As most of this data is natural language text,there is a need for efficient text analysis techniques which enable extraction of useful data. This process is called text mining, and in this thesis some ofthese techniques are evaluated for the purpose of integrating them into thevisual data mining software TIBCO Spotfire®.

In total, five analysis models with different running time, memory use andperformance have been analyzed, implemented and evaluated. The tf-idf vectorspace model was used as a baseline. It can be extended using Latent SemanticAnalysis and random projection to find latent semantic relationshipsbetween documents. Finally, Latent Dirichlet Allocation (LDA), Joint Sentiment/Topic model (JST) and Sentiment Latent Dirichlet Allocation (SLDA)are used to extract topics. The latter two are extensions to LDA which alsodetects positive and negative sentiment.

Evaluation was done using the perplexity measure for topic modeling, averageprecision for searching and classification accuracy of positive and negativereviews for the sentiment models. It was concluded that for searching, avector space model with tf-idf weighting had similar performance comparedto the latent semantic models for the test corpus used. Topic modelingshowed to provide useful output, however at the expense of running time. TheJST and SLDA sentiment detectors showed a small improvement compared toa baseline word counting classifier, especially for a multiple domain dataset.Finally it was shown that they had mixed sentiment classification accuracy from run to run, indicating that further investigation is motivated.

  CLICK HERE TO DOWNLOAD THE WHOLE ESSAY. (in PDF format)