Topic modeling on a classical Swedish text corpus of prose fiction : Hyperparameters’ effect on theme composition and identification of writing style

University essay from Uppsala universitet/Statistiska institutionen

Abstract: A topic modeling method, smoothed Latent Dirichlet Allocation (LDA) is applied on a text corpus data of classical Swedish prose fiction. The thesis consists of two parts. In the first part, a smoothed LDA model is applied to the corpus, investigating how changes in hyperparameter values affect the topics in terms of distribution of words within topics and topics within novels. In the second part, two smoothed LDA models are applied to a reduced corpus, only consisting of adjectives. The generated topics are examined to see if they are more likely to occur in a text of a particular author and if the model could be used for identification of writing style. With this new approach, the ability of the smoothed LDA model as a writing style identifier is explored. While the texts analyzed in this thesis is unusally long - as they are not seg- mented prose fiction - the effect of the hyperparameters on model performance was found to be similar to those found in previous research. For the adjectives corpus, the models did succeed in generating topics with a higher probability of occurring in novels by the same author. The smoothed LDA was shown to be a good model for identification of writing style. Keywords: Topic modeling, Smoothed Latent Dirichlet Allocation, Gibbs sam- pling, MCMC, Bayesian statistics, Swedish prose fiction. 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)