Comparative Study of tools for Statistical Learning : Comparing the performance of data science languages and frameworks

University essay from KTH/Datavetenskap

Author: Alexander Westman; Adam Henriksson; [2022]

Keywords: ;

Abstract: Statistical learning is a sub-field of computer science that studies algorithms which try to solve the statistical inference problem of finding a predictive function based on data. This is used in fields such as finance, computer vision, and bioinformatics. Since this research field is currently both relatively new and highly attractive, there is value in performing comparative research on subjects affecting new developers entering this space. There is currently a lack of research on the comparative usability and performance of entry-level tools used in statistical learning, especially for the ARM hardware architecture. This study focuses on comparing the execution time and scalability of popular tools for statistical learning: Scikit-learn for Python, JuliaSTAT for Julia, and R-stats for R. Each framework’s implementation of the statistical method "general strategy for regression analysis" was compared based on execution time with a varying number of observations and regressors, using data with varying properties. While Julia was not the best performer in all cases, it was found to handle scaling of input size the best as the number of observations and regressors increase by a significant margin. R using R-Stat, and Python using SciKit-Learn performed significantly worse. Compared to JuliaSTAT on the largest measured dataset, the R-Stat implementation had a 10x increase in execution time, and the SciKit-Learn implementation had a 100x increase.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)