Acceleration of Machine-Learning Pipeline Using Parallel Computing

University essay from Uppsala universitet/Signaler och system

Abstract: Researchers from Lund have conducted research on classifying images in three different categories, faces, landmarks and objects from EEG data [1]. The researchers used SVMs (Support Vector Machine) to classify between the three different categories [2, 3]. The scripts written to compute this had the potential to be extremely parallelized and could potentially be optimized to complete the computations much faster. The scripts were originally written in MATLAB which is a propriety software and not the most popular language for machine learning. The aim of this project is to translate the MATLAB code in the aforementioned Lund project to Python and perform code optimization and parallelization, in order to reduce the execution time. With much other data science transitioning into Python as well, it was a key part in this project to understand the differences between MATLAB and Python and how to translate MATLAB code to Python. With the exception of the preprocessing scripts, all the original MATLAB scripts were translated to Python. The translated Python scripts were optimized for speed and parallelized to decrease the execution time even further. Two major parallel implementations of the Python scripts were made. One parallel implementation was made using the Ray framework to compute in the cloud [4]. The other parallel implementation was made using the Accelerator, a framework to compute using local threads[5]. After translation, the code was tested versus the original results and profiled for any key mistakes, for example functions which took unnecessarily long time to execute. After optimization the single thread script was twelve times faster than the original MATLAB script. The final execution times were around 12−15 minutes, compared to the benchmark of 48 hours it is about 200 times faster. The benchmark of the original code used less iterations than the researchers used, decreasing the computational time from a week to 48 hours. The results of the project highlight the importance of learning and teaching basic profiling of slow code. While not entirely considered in this project, doing complexity analysis of code is important as well. Future work includes a deeper complexity analysis on both a high and low level, since a high level language such as Python relies heavily on modules with low level code. Future work also includes an in-depth analysis of the NumPy source code, as the current code relies heavily on NumPy and has shown tobe a bottleneck in this project.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)