A Comparison of Resampling Techniques to Handle the Class Imbalance Problem in Machine Learning : Conversion prediction of Spotify Users - A Case Study

University essay from KTH/Skolan för datavetenskap och kommunikation (CSC)

Author: Michelle Jagelid; Maria Movin; [2017]

Keywords: ;

Abstract: Spotify uses a freemium business model, meaning that it has two main products, one free limited and one premium for paying customers. In this study we investigated machine learning models’ abilities, given user activity data, to predict conversion from free to premium. Predicting which of the users convert from free to premium was a class-imbalanced problem, meaning that the ratio of converters and non-converters was skewed. Three methods were investigated: logistic regression, decision trees, and gradient boosting trees. We also studied if different resampling methods, which balance the train datasets, can improve classification performance of the models. We showed that machine learning models are able to find patterns in user data that could be used to predict conversion. Additionally, for all our investigated classification methods, we showed that resampling increased the models’ performances. The methods with best performances in our study were logistic regression and gradient boosting tree trained with oversampled data up to equal numbers of converters and non-converters.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)