Lakehouse architecture forsimplifying data sciencepipelines

University essay from Uppsala universitet/Institutionen för informationsteknologi

Author: Nicolas Martin; [2023]

Keywords: ;

Abstract: Data management and pre-processing often consume the majority of time spent by datascientists. The data architecture and the configuration of data pipelines significantly influencethe efficiency of this work. An emerging ’Lakehouse’ architecture combines the features of botha Data Lake and a Data Warehouse, eliminating the need to manage a two-tier system. Thisallows for the storage and processing of raw, structured, and semi-structured data on a unifiedplatform, offering higher performance and decoupling computing from storage. The capabilitiesof this architecture are explored within Trase.earth, a leading initiative in commodity supplychain transparency that focuses on agricultural products driving deforestation. This thesisdemonstrates that the Lakehouse architecture can simplify intricate data pipelines whileenabling new functionalities. It also shows that this transition can be madebackwards-compatible, rely on open standards, and reduce costs. The enhancements analyzedinclude data ingestion from heterogeneous sources, data discoverability, metadatamanagement, data sharing, and pipeline management with the integration of data qualityexpectations. As an additional case study, graph data mining techniques are applied to the beefsupply chain in the state of Pará, Brazil, using a dataset of sanitary records for animaltransportation. Various methods for deriving and analyzing paths of indirect sourcing areemployed, facilitating the identification and characterization of the most frequently traveledroutes, trade communities, and node centrality.The code related to this thesis can be found in: https://github.com/nmartinbekier/ds_de_thesis 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)