Integrating Pig and Stratosphere

University essay from KTH/Skolan för informations- och kommunikationsteknik (ICT)

Author: Vasiliki Kalavri; [2012]

Keywords: ;

Abstract: MapReduce is a wide-spread programming model for processing big amounts of data in parallel. PACT is a generalization of MapReduce, based on the concept of Parallelization Contracts (PACTs). Writing efficient applications in MapReduce or PACT requires strong programming skills and in-depth understanding of the systems’ architectures. Several high-level languages have been developed, in order to make the power of these systems accessible to non-experts, save development time and make application code easier to understand and maintain. One of the most popular high-level dataflow systems is Apache Pig. Pig overcomes Hadoop’s oneinput and two-stage dataflow limitations, allowing the developer to write SQL-like scripts. However, Hadoop’s limitations are still present in the backend system and add a notable overhead to the execution time. Pig is currently implemented on top of Hadoop, however it has been designed to be modular and independent of the execution engine. In this thesis project, we propose the integration of Pig with another framework for parallel data processing, Stratosphere. We show that Stratosphere has desirable properties that significantly improve Pig’s performance. We present an algorithm that translates Pig Latin scripts into PACT programs that can be executed on the Nephele execution engine. We also present a prototype system that we have developed and we provide measurements on a set of basic Pig scripts and their native MapReduce and PACT implementations. We show that the Pig-Stratosphere integration is very promising and can lead to Pig scripts executing even more efficiently than native MapReduce applications.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)