Predicting user churn using temporal information : Early detection of churning users with machine learning using log-level data from a MedTech application

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: User retention is a critical aspect of any business or service. Churn is the continuous loss of active users. A low churn rate enables companies to focus more resources on providing better services in contrast to recruiting new users. Current published research on predicting user churn disregards time of day and time variability of events and actions by feature selection or data preprocessing. This thesis empirically investigates the practical benefits of including accurate temporal information for binary prediction of user churn by training a set of Machine Learning (ML) classifiers on differently prepared data. One data preparation approach was based on temporally sorted logs (log-level data set), and the other on stacked aggregations (aggregated data set) with additional engineered temporal features. The additional temporal features included information about relative time, time of day, and temporal variability. The inclusion of the temporal information was evaluated by training and evaluating the classifiers with the different features on a real-world dataset from a MedTech application. Artificial Neural Networks (ANNs), Random Forrests (RFs), Decision Trees (DTs) and naïve approaches were applied and benchmarked. The classifiers were compared with among others the Area Under the Receiver Operating Characteristics Curve (AUC), Positive Predictive Value (PPV) and True Positive Rate (TPR) (a.k.a. precision and recall). The PPV scores the classifiers by their accuracy among the positively labeled class, the TPR measures the recognized proportion of the positive class, and the AUC is a metric of general performance. The results demonstrate a statistically significant value of including time variation features overall and particularly that the classifiers performed better on the log-level data set. An ANN trained on temporally sorted logs performs best followed by a RF on the same data set.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)