Column-based storage for analysis of high-frequency stock trading data

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Author: Abdallah Hassan; [2019]

Keywords: ;

Abstract: This study investigated the efficiency of the available open-source columnbased storage formats with support for semi-flexible data in combination with query engines that support querying these formats. Two different formats were identified, Parquet and ORC, and both were tested in two different modes, uncompressed and compressed with the compression algorithm Snappy. They were tested by running two queries on the host company’s data converted to the appropriate formats, one simple averaging query and one more complicated with counts and filtering. The queries were run with two different query engines, Spark and Drill. They were also run on two dataset with different sizes to test scalability. The query execution time was recorded for each tested alternative. The results show that Snappy compressed formats always outperformed their non-compressed counterparts, and that Parquet was always faster than ORC. Drill performed faster on the simple query while Spark performed faster on the complex query. Drill also had the least increase in query execution time when the size of the dataset increased on both queries. The conclusion is that Parquet with Snappy is the storage format which gives the fastest execution times. However, both Spark and Drill have their own advantages as query engines.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)