Prediction models for soccer sports analytics

University essay from Linköpings universitet/Databas och informationsteknik

Author: Edward Nsolo; [2018]

Keywords: ;

Abstract: In recent times there has been a substantial increase in research interest of soccer due to an increase of availability of soccer statistics data. With the help of data provider firms, access to historical soccer data becomes more simple and as a result data scientists started researching in the field. In this thesis, we develop prediction models that could be applied by data scientists and other soccer stakeholders. As a case study, we run several machine learning algorithms on historical data from five major European leagues and make a comparison. The study is built upon the idea of investigating different approaches that could be used to simplify the models while maintaining the correctness and the robustness of the models. Such approaches include feature selection and conversion of regression prediction problems to binary classification problems. Furthermore, a literature review study did not reveal research attempts about the use of a generalization of binary classification predictions that applies different target class upper boundaries other than 50% frequency binning. Thus, this thesis investigated the effects of such generalization against simplicity and performance of such models. We aimed to extend the traditional discretization of classes with equal frequency binning function which is standard for converting regression problems into the binary classification in many applications. Furthermore, we ought to establish important players’ features in individual leagues that could help team managers to have cost-efficient transferring strategies. The approach of selecting those features was achieved successfully by the application of wrapper and filter algorithms. Both methods turned out to be useful algorithms as the time taken to build the models was minimal, and the models were able to make good predictions. Furthermore, we noticed different features matter for different leagues. Therefore, in accessing the performance of players, such consideration should be kept in mind. Different machine learning algorithms were found to behave differently under different conditions. How-ever, Naïve Bayes was determined to be the best-fit in most cases. Moreover, the results suggest that it is possible to generalize binary classification problems and maintain the performance to a reasonable extent. But, it should be observed that the early stages of generalization of binary classification models involve a tedious work of training datasets, and that fact should be a tradeoff when thinking to use this approach.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)