Comparison of Popular Data Processing Systems
Abstract: Data processing is generally defined as the collection and transformation of data to extract meaningful information. Data processing involves a multitude of processes such as validation, sorting summarization, aggregation to name a few. Many analytics engines exit today for largescale data processing, namely Apache Spark, Apache Flink and Apache Beam. Each one of these engines have their own advantages and drawbacks. In this thesis report, we used all three of these engines to process data from the Carbon Monoxide Daily Summary Dataset to determine the emission levels per area and unit of time. Then, we compared the performance of these 3 engines using different metrics. The results showed that Apache Beam, while offered greater convenience when writing programs, was slower than Apache Flink and Apache Spark. Spark Runner in Beam was the fastest runner and Apache Spark was the fastest data processing framework overall.
AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)