Authorship classification using the Vector Space Model and kernel methods

University essay from Uppsala universitet/Statistiska institutionen

Abstract: Authorship identification is the field of classifying a given text by its author based on the assumption that authors exhibit unique writing styles. This thesis investigates the semantic shortcomings of the vector space model by constructing a semantic kernel created from WordNet which is evaluated on the problem of authorship attribution. A multiclass SVM classifier is constructed using the one-versus-all strategy and evaluated in terms of precision, recall, accuracy and F1 scores. Results show that the use of the semantic scores from WordNet degrades the performance compared to using a linear kernel. Experiments are run to identify the best feature engineering configurations, showing that removing stopwords has a positive effect on the financial dataset Reuters while the Kaggle dataset consisting of short extracts of horror stories benefit from keeping the stopwords.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)