Hive, Spark, Presto for Interactive Queries on Big Data

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: Traditional relational database systems can not be efficiently used to analyze data with large volume and different formats, i.e. big data. Apache Hadoop is one of the first open-source tools that provides a distributed data storage system and resource manager. The space of big data processing has been growing fast over the past years and many technologies have been introduced in the big data ecosystem to address the problem of processing large volumes of data, and some of the early tools have become widely adopted, with Apache Hive being one of them. However,with the recent advances in technology, there are other tools better suited for interactive analytics of big data, such as Apache Spark and Presto. In this thesis these technologies are examined and benchmarked in order to determine their performance for the task of interactive business intelligence queries. The benchmark is representative of interactive business intelligence queries, and uses a star-shaped schema. The performance HiveTez, Hive LLAP, Spark SQL, and Presto is examined with text, ORC, Parquet data on different volume and concurrency. A short analysis and conclusions are presented with the reasoning about the choice of framework and data format for a system that would run interactive queries on bigdata.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)