Language IndependentDetector for Auto GeneratedTweets

University essay from Linnéuniversitetet/Institutionen för datavetenskap (DV)

Abstract: The cross-disciplinary Nordic Tweet Stream (NTS) is a project aiming at creating a multilingual text corpus consisting of tweets published in the five Nordic countries. The NTS linguists are explicitly interested in tweets having a text formulated by a human where each tweet is a personal statement, not in Tweets generated by bots and other programs or apps since they might skew the results. NTS consists of multiple parts and the part we are responsible for is a language-independent approach, using supervised machine learning, to classify every single tweet as auto-generated (AGT) or human-generated (HGT). The objective of this study is to increase data accuracy in sociolinguistic studies that utilize Twitter by reducing skewed sampling and inaccuracies in linguistic data. We define an AGT as a tweet where all or parts of the natural language content are generated automatically by a bot or other type of program. In other words, while AGT/HGT refers to an individual message, the term bot refers to nonpersonal and automated accounts that post content to online social networks. Our approach classifies a tweet using only metadata that comes with every tweet, and we utilize those metadata parameters that are both language and country independent. The empirical part shows that our results show poor success rates when it comes to unseen data. Using a bilingual training set of two languages tweets, we correctly classified only about 60-70% of all tweets in a test set using a third new language, which is still better than nothing, but probably not good enough to be used (as is) in a real-world scenario to identify AGTs in a given set of multilingual tweets.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)