MDX on Hadoop : A case study on OLAP for Big Data

University essay from KTH/Skolan för informations- och kommunikationsteknik (ICT)

Author: Jakob Stengård; [2015]

Keywords: ;

Abstract: Online Analytical Processing (OLAP) is a method used for analyzing data within business intelligence and data mining, using n-dimensional hyper cubes. These cubes stores the aggregates of multiple dimensions of the data, and can traditionally be computed from a dimensional relational model in SQL databases, known as a star schema. Multidimensional expressions are a type of queries commonly used by BI tools to query OLAP cubes. This thesis investigates ways to conduct one-line OLAP like queries against a dimensional relational model, based in a Hadoop cluster. In the evaluation, Hive-on-Spark and Hive-on-Tez and various formats have been compared. The most significant conclusions are that Hive-on-Tez delivers better performance than Hive-on-Spark, and that the ORC format seems to be the best performing format. It could not be demonstrated that less than 20-second performance could be achieved for all queries with the given setup and dataset or that order of input data significantly affects the performance of the ORC format. Scaling seems fairly linear for a cluster of 3 nodes. It also could not be demonstrated that Hive indexes or bucketing improves performance.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)