DataOps : Towards Understanding and Defining Data Analytics Approach

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: Data collection and analysis approaches have changed drastically in the past few years. The reason behind adopting different approach is improved data availability and continuous change in analysis requirements. Data have been always there, but data management is vital nowadays due to rapid generation and availability of various formats. Big data has opened the possibility of dealing with potentially infinite amounts of data with numerous formats in a short time. The data analytics is becoming complex due to data characteristics, sophisticated tools and technologies, changing business needs, varied interests among stakeholders, and lack of a standardized process. DataOps is an emerging approach advocated by data practitioners to cater to the challenges in data analytics projects. Data analytics projects differ from software engineering in many aspects. DevOps is proven to be an efficient and practical approach to deliver the project in the Software Industry. However, DataOps is still in its infancy, being recognized as an independent and essential task data analytics. In this thesis paper, we uncover DataOps as a methodology to implement data pipelines by conducting a systematic search of research papers. As a result, we define DataOps outlining ambiguities and challenges. We also explore the coverage of DataOps to different stages of the data lifecycle. We created comparison matrixes of different tools and technologies categorizing them in different functional groups to demonstrate their usage in data lifecycle management. We followed DataOps implementation guidelines to implement data pipeline using Apache Airflow as workflow orchestrator inside Docker and compared with simple manual execution of a data analytics project. As per evaluation, the data pipeline with DataOps provided automation in task execution, orchestration in execution environment, testing and monitoring, communication and collaboration, and reduced end-to-end product delivery cycle time along with the reduction in pipeline execution time. 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)