Scaling Apache Hudi by boosting query performance with RonDB as a Global Index : Adopting a LATS data store for indexing

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: The storage and use of voluminous data are perplexing issues, the resolution of which has become more pressing with the exponential growth of information. Lakehouses are relatively new approaches that try to accomplish this while hiding the complexity from the user. They provide similar capabilities to a standard database while operating on top of low-cost storage and open file formats. An example of such a system is Hudi, which internally uses indexing to improve the performance of data management in tabular format. This study investigates if the execution times could be decreased by introducing a new engine option for indexing in Hudi. Therefore, the thesis proposes the usage of RonDB as a global index, which is expanded upon by further investigating the viability of different connectors that are available for communication. The research was conducted using both practical experiments and the study of relevant literature. The analysis involved observations made over multiple workloads to document how adequately the solutions can adapt to changes in requirements and types of actions. This thesis recorded the results and visualized them for the convenience of the reader, as well as made them available in a public repository. The conclusions did not coincide with the author’s hypothesis that RonDB would provide the fastest indexing solution for all scenarios. Nonetheless, it was observed to be the most consistent approach, potentially making it the best general-purpose solution. As an example, it was noted, that RonDB is capable of dealing with read and write heavy workloads, whilst consistently providing low query latency independent from the file count.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)