High-risk Consumer Credit Scoring using Machine Learning Classification

University essay from Lunds universitet/Matematisk statistik

Abstract: The use of statistical models in credit rating and application scorecard modelling is a thoroughly explored field within the financial sector and a central component in a credit institution’s underlying business model. The aim of this report was to apply and compare six different machine learning models in predicting credit defaults for high-risk consumer credits, using a data set provided by a Swedish consumer credit institute. The selected models include the ones most frequently used for scorecard modelling across the banking industry as well as some more rarely used that could potentially add valuable insights. The models are briefly introduced and the most important concepts for each model are explained, as well as how to deal with the lack of transparency in complex models by the use of white-boxing methods. Appropriate metrics for evaluating prediction performance on imbalanced and insufficient data sets are discussed, as well as how to increase model performance by using different oversampling techniques. All available information about the loan applicants was then exhaustively examined, and a carefully refined set of input features was constructed to ensure optimal predictive power and generalizability. After tuning and testing the models, the results showed that logistic regression, support vector machine, neural network and a soft voting ensemble showed similar performance results using the same input feature configurations. Attempts to create synthesized samples to handle the imbalance problem showed no effect and was therefore not used. The white-boxing model SHAP showed a promising ability to instructively explain the underlying decision basis for complex models such as neural networks. However, considering the limited data set at hand, the recommended model to use is the logistic regression model given its simplicity and on a par performance with the other models. Having larger amounts of data available on the other hand, the more complex models such as neural networks and support vector machines could have a potential advantage.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)