Load Balancing in a Distributed Storage System for Big and Small Data

University essay from KTH/Skolan för informations- och kommunikationsteknik (ICT)

Author: Lars Kroll; [2013]

Keywords: ;

Abstract: Distributed storage services form the backbone of modern large-scale applications and data processing solutions. In this integral role they have to provide a scalable, reliable and performant service. One of the major challenges any distributed storage system has to address is skew in the data load, which can either be in the distribution of data items or data access over the nodes in the system. One widespread approach to deal with skewed load is data assignment based on uniform consistent hashing. However, there is an opposing desire to optimise and exploit data-locality. That is to say, it is advantageous to collocate items that are typically accessed together. Often this locality property can be achieved by storing keys in an ordered fashion and using application level knowledge to construct keys in such a way that items accessed together will end up very close together in the key space. It can easily be seen, however, that this behaviour exacerbates the load skew issue. A different approach to load balancing is partitioning the data into small subsets which can be relocated independently. These subsets may be known as partitions, tablets or virtual nodes, for example. In this thesis we present the design of CaracalDB, a distributed keyvalue store which provides automatic load-balancing and data-locality, as well as fast re-replication after node failures, while remaining flexible enough to support different consistency levels to choose from. We also evaluate an early prototype of the system, and show that the approach is viable.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)