Heterogeneous Storage in HopsFS
Abstract: In the recent years, the Apache Hadoop distributed file system (HDFS) has become increasingly popular for the storage of large data sets. Both the volume of the data and the variety of applications is unprecedented. The variety of tasks, each with its own access pattern and demands, calls for a file system that supports specialized storages for different tasks. This thesis describes the implementation of heterogeneous storage in HopsFS, a highly-available, highly-scalable version of HDFS. This makes the cluster aware of different storage types (e.g. hard disks and solid state drives) and allows users to specify preferred storage types for their data. By introducing new storage types, we build in support for storage technologies like SSDs and RAID. The latter is especially of interest, since it increases both bandwidth and reliability of the storage on individual nodes while continuing commodity hardware. Since network bandwidth is increasing orders of magnitude faster than disk bandwidth, increasing the disk throughput is of vital importance to avoid local storage becoming a bottleneck. The heterogeneous storage Application Programming Interface (API) described in this thesis offers HDFS administrators more control over their data while being compatible with the HDFS framework. Users can choose whether they want files stored on traditional disks, SSDs or more complex constructions using RAID and erasure coding.
AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)