Distributed Robust Learning

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: Accuracy obtained when training deep learning models with large amounts of data is high, however, training a model with such huge amounts of data on a single node is not feasible due to various reasons. For example, it might not be possible to fit the entire data set in the memory of a single node, training times can significantly increase since the dataset is huge. To avoid these problems decentralized training is devised. In decentralized training, using the technique of Data parallelism, multiple nodes/workers make a local copy of the model and train it using a partition of the dataset. Each of these locally trained models are aggregated at some point to obtain the final trained model. Architectures such as Parameter server, All-reduce and Gossip use their own network topology to implement decentralized training. However, there is a vulnerability in this decentralized setting, any of the worker nodes may behave arbitrarily and fail. This type of failure is called byzantine failure. Here, arbitrary means any of the worker nodes may send incorrect parameters to others, which may lead to inaccuracy of the global model or failure of the entire system sometimes. To tolerate such arbitrary failures, aggregation rules were devised and are tested using Parameter server architecture. In this thesis we analyse the fault tolerance of Ring all-reduce architecture to byzantine gradients using various aggregation rules such as Krum, Brute and Bulyan. We will also inject adversaries during the model training to observe, which of the aforementioned aggregation rules provide better resilience to byzantine gradients. 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)