Performance Tuning of Big Data Platform : Cassandra Case Study

University essay from Blekinge Tekniska Högskola/Institutionen för kommunikationssystem

Abstract: Usage of cloud-based storage systems gained a lot of prominence in fast few years. Every day millions of files are uploaded and downloaded from cloud storage. This data that cannot be handled by traditional databases and this is considered to be Big Data. New powerful platforms have been developed to store and organize big and unstructured data. These platforms are called Big Data systems. Some of the most popular big data platform are Mongo, Hadoop, and Cassandra. In this, we used Cassandra database management system because it is an open source platform that is developed in java. Cassandra has a masterless ring architecture. The data is replicated among all the nodes for fault tolerance. Unlike MySQL, Cassandra uses per-column basis technique to store data. Cassandra is a NoSQL database system, which can handle unstructured data. Most of Cassandra parameters are scalable and are easy to configure. Amazon provides cloud computing platform that helps a user to perform heavy computing tasks over remote hardware systems. This cloud computing platform is known as Amazon Web Services. AWS services also include database deployment and network management services, that have a non-complex user experience. In this document, a detailed explanation on Cassandra database deployment on AWS platform is explained followed by Cassandra performance tuning.    In this study impact on read and write performance with change Cassandra parameters when deployed on Elastic Cloud Computing platform are investigated. The performance evaluation of a three node Cassandra cluster is done. With the knowledge of configuration parameters a three node, Cassandra database is performance tuned and a draft model is proposed.             A cloud environment suitable for the experiment is created on AWS. A three node Cassandra database management system is deployed in cloud environment created. The performance of this three node architecture is evaluated and is tested with different configuration parameters. The configuration parameters are selected based on the Cassandra metrics behavior with the change in parameters. Selected parameters are changed and the performance difference is observed and analyzed. Using this analysis, a draft model is developed after performance tuning selected parameters. This draft model is tested with different workloads and compared with default Cassandra model. The change in the key cache memory and memTable parameters showed improvement in performance metrics. With increases of key cache size and save time period, read performance improved. This also showed effect on system metrics like increasing CPU load and disk through put, decreasing operation time and The change in memTable parameters showed the effect on write performance and disk space utilization. With increase in threshold value of memTable flush writer, disk through put increased and operation time decreased. The draft derived from performance evaluation has better write and read performance.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)