EVALUATION OF MACHINE LEARNING ALGORITHMS FOR SMS SPAM FILTERING

University essay from Umeå universitet/Institutionen för datavetenskap

Author: David Bäckman; [2019]

Keywords: ;

Abstract: The purpose of this thesis is to evaluate different machine learning algorithms and methods for text representation in order to determine what is best suited to use to distinguish between spam SMS and legitimate SMS. A data set that contains 5573 real SMS has been used to train the algorithms K-Nearest Neighbor, Support Vector Machine, Naive Bayes and Logistic Regression. The different methods that have been used to represent text are Bag of Words, Bigram and Word2Vec. In particular, it has been investigated if semantic text representations can improve the performance of classification. A total of 12 combinations have been evaluated with help of the metrics accuracy and F1-score.The results shows that Logistic Regression together with Bag of Words reach the highest accuracy and F1-score. Bigram as text representation seems to work worse then the others methods. Word2Vec can increase the performnce for K-Nearst Neigbor but not for the other algorithms.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)