Big Data Workflows: DSL-based Specification and Software Containers for Scalable Execution

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: Big Data workflows are composed of multiple orchestration steps that perform different data analytics tasks. These tasks process heterogeneous data using various computing and storage resources. Due to the diversity of application domains, involved technologies, and complexity of data sets, the design and implementation of Big Data workflows require the collaboration of domain experts and technical experts. However, existing tools are too technical and cannot easily allow domain experts to participate in the process of defining and executing Big Data workflows. Moreover, the majority of existing tools are designed for specific applications such as bioinformatics, computational chemistry, and genomics. They are also based on specific technology stacks that do not provide flexible means of code reuse and maintenance. This thesis presents the design and implementation of a Big Data workflow solution based on the use of a domain-specific language (DSL) for hiding complex technical details, enabling domain experts to participate in the process definition of workflows. The workflow solution uses a combination of software container technologies and message-oriented middleware (MOM) to enable highly scalable workflow execution. The applicability of the solution is demonstrated by implementing a prototype based on a real-world data workflow. As per performed evaluations, the proposed workflow solution was evaluated to provide efficient workflow definition and scalable execution. Furthermore, the results of a set of experiments were presented, comparing the performance of the proposed approach with Argo Workflows, one of the most promising tools in the area of Big Data workflows.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)