A comparative study of word embedding methods for early risk prediction on the Internet

University essay from Uppsala universitet/Institutionen för lingvistik och filologi

Abstract: We built a system to participate in the eRisk 2019 T1 Shared Task. The aim of the task was to evaluate systems for early risk prediction on the internet, in particular to identify users suffering from eating disorders as accurately andquickly as possible given their history of Reddit posts in chronological order. In the controlled settings of this task, we also evaluated the performance of three different word representation methods: random indexing, GloVe, and ELMo.We discuss our system’s performance, also in the light of the scores obtained by other teams in the shared task. Our results show that our two-step learning approach was quite successful, and we obtained good scores on the early risk prediction metric ERDE across the board. Contrary to our expectations, we did not observe a clear-cut advantage of contextualized ELMo vectors over the commonly used and much more light-weight GloVevectors. Our best model in terms of F1 score turned out to be a model with GloVe vectors as input to the text classifier and a multi-layer perceptron as user classifier. The best ERDE scores were obtained by the model with ELMo vectors and a multi-layer perceptron. The model with random indexing vectors hit a good balance between precision and recall in the early processing stages but was eventually surpassed by the models with GloVe and ELMo vectors. We put forward some possible explanations for the observed results, as well as proposing some improvements to our system.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)