S3-HopsFS: A Scalable Cloud-native Distributed File System

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Author: Joel Stenkvist; [2019]

Keywords: ;

Abstract: Data has been regarded as the new oil in today’s modern world. Data is generated everywhere from how you do online shopping to where you travel. Companies rely on analyzing this data to make informed business decisions and improve their products and services. However, storing this massive amount of data can be very expensive. Current distributed file systems rely on commodity hardware to provide strongly consistent data storage for big data analytics applications, such as Hadoop and Spark. Running these storage clusters can be very costly; it is estimated that storing 100 TB in an HDFS cluster with AWS EC2 costs $47,000 per month. On the other hand, using cloud storage such as Amazon’s S3 to store 100 TB only costs about $3,000 per month however S3 is not sufficient due to eventual consistency and low performance. Therefore, combining these two solutions is optimal for a cheap, consistent, and fast file system.This thesis outlines and builds a new class of distributed file system that utilizes cloud native block storage as the data-layer, such as Amazon’s S3. AWS recently increased the bandwidth from S3 to EC2 from 5 Gbps to 25Gbps, sparking new interest in this area. The new system is built on top of HopsFS; a hierarchical, distributed file system with a scale-out metadata layer utilizing an in-memory, distributed database called NDB which dramatically increases the scalability of the file system. In combination with native cloud storage, this new file system reduces the price of deployment by up to 15 times, but at a performance cost of 25% of the original HopsFS system (four times slower). However, tests in this research shows that S3-HopsFS can be improved towards 38% of the original performance by comparing it with only using S3 by itself. In addition to the new HopsFS version, S3Guard was developed to use NDB instead of Amazon’s DynamoDB to store the file tree hierarchy metadata. S3Guard is a tool that allows big data analytics applications such as Hive to utilize S3 as a direct input and output source for queries. The eventual consistency problems of S3 have been solved and tests show a 36% performance boost when listing and deleting files and directories. S3Guard is sufficient to support some big data analytic applications like Hive, but we lose all the benefits of HopsFS like the performance, scalability, and extended metadata -therefore we need a new file system combining both solutions.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)