Big Data and Analytics with Driving  Data : Implementation and Analysis of Data Pipeline and Data Processing Resources

University essay from Uppsala universitet/Institutionen för informationsteknologi

Author: Ivar Blohm; Erik Jarvis; [2023]

Keywords: ;

Abstract: This thesis project was conducted in cooperation with Zenseact for the purpose of investigating the possible usage of Google BigQuery and its capabilities to store and provide insights of large time-series data. An end-to-end data pipeline was built to facilitate the movement of data from Zenseact's local servers and ingestion into BigQuery. Due to the large size and distribution of the data, the pipeline implemented parallelization across multiple servers to handle local data transformation and cleansing before upload. To compare the performance of BigQuery, an alternative On-Premises solution was constructed using Apache Spark. These two different options were compared in terms of time and computational cost, financial cost, query performance, and user experience, with test queries being run on both platforms. BigQuery was able to run shorter queries faster at a relatively low cost, while providing compatibility with useful Google Cloud Platform tools. However, Spark was able to run larger queries faster, albeit with a large dedication of local resources and less user-friendly experience. Additionally, two different machine learning models were applied to exemplify the capability that arises when one could efficiently query the complete time-series dataset. It was proven that the models could provide interesting insights for specific scenarios with a reasonable workload. 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)