Comparing Text Classification Libraries in Scala and Python : A comparison of precision and recall

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: In today’s internet era, more text than ever is being uploaded online. The text comes in many forms, such as social media posts, business reviews, and many more. For various reasons, there is an interest in analyzing the uploaded text. For instance, an airline business could ask their customers to review the service they have received. The feedback would be collected by asking the customer to leave a review and a score. A common scenario is a review with a good score that contains negative aspects. It is preferable to avoid a situation where the entirety of the review is regarded as positive because of the score if there are negative aspects mentioned. A solution to this would be to analyze each sentence of a review and classify it by negative, neutral or, positive depending on how the sentence is perceived.  With the amount of text uploaded today, it is not feasible to manually analyze text. To automatically classify text by a set of criteria is called text classification. The process of specifically classifying text by how it is perceived is a subcategory of text classification known as sentiment analysis. Positive, neutral and, negative would be the sentiments to classify.  The most popular frameworks associated with the implementation of sentiment analyzers are developed in the programming language Python. However, over the years, text classification has had an increase in popularity. The increase in popularity has caused new frameworks to be developed in new programming languages. Scala is one of the programming languages that has had new frameworks developed to work with sentiment analysis. However, in comparison to Python, it has fewer available resources. Python has more available libraries to work with, available documentation, and community support online. There are even fewer resources regarding sentiment analysis in a less common language such as Swedish. The problem is no one has compared a sentiment analyzer for Swedish text implemented using Scala and compared it to Python. The purpose of this thesis is to compare recall and precision of a sentiment analyzer implemented in Scala to Python. The goal of this thesis is to increase the knowledge regarding the state of text classification for less common natural languages in Scala.  To conduct the study, a qualitative approach with the support of quantitative data was used. Two kinds of sentiment analyzers were implemented in Scala and Python. The first classified text as either positive or negative (binary sentiment analysis), the second sentiment analyzer would also classify text as neutral (multiclass sentiment analysis). To perform the comparative study, the implemented analyzers would perform classification on text with known sentiments. The quality of the classifications was measured using their F1-score.  The results showed that Python had better recall and quality for both tasks. In the binary task, there was not as large of a difference between the two implementations. The resources from Python were more specialized for Swedish and did not seem to be as affected by the small dataset used as the resources in Scala. Scala had an F1-score of 0.78 for binary sentiment analysis and 0.65 for multiclass sentiment analysis. Python had an F1-score of 0.83 for binary sentiment analysis and 0.78 for multiclass sentiment analysis. 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)