Comparison of VADER and Pre-Trained RoBERTa: A Sentiment Analysis Application

University essay from Lunds universitet/Statistiska institutionen

Abstract: Purpose: The purpose of this study is to examine how the overall sentiment results from VADER and a pre-trained RoBERTa model differ. The study investigates potential differences in terms of the median and shape of the two distributions. Data: The sustainability reports of 50 independent random companies are selected as the sample. The number of non-responses is 6, which means that the reports of 44 companies are included in the study. Furthermore, the total number of paragraphs in the investigated sample is 320. The number of words per paragraph ranges from 16 to 234. Methods: VADER is a dictionary- and rule-based sentiment analyzer built on a combination of five heuristics and a dictionary of words that connects lexical features to sentiment intensity. The model is accessed through the NLTK library in Python. The algorithm provides four numbers: positive, neutral, negative and a compound score. RoBERTa is a variation of the BERT model, which is based on transformers and a concept called self-attention to be able to associate words with other words in order to understand context. A pre-trained version of the model is utilized in this study. The model provides three values: positive, neutral and negative. A fourth overall sentiment score is computed for comparison to VADER’s compound score. Results: A two-sample Kolmogorov-Smirnov test shows that the two scores are drawn from different distributions. Furthermore, a Wilcoxon signed-rank test shows that the median of the differences between the VADER compound score and the RoBERTa polarity score are not zero. In other words, there is a difference in location. The final general conclusion is that there is a difference between the scores both when considering location and shape combined and when only considering location.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)