Impact of model architecture and data distribution on self-supervised federated learning

University essay from Lunds universitet/Matematik LTH

Abstract: Data is a crucial resource for machine learning. But in many settings, such as in healthcare or on mobile devices, there are obstacles that make it difficult to utilize the available data. This data is often distributed between many clients and private, meaning that central storage of the data is inadvisable. Further, image data is often unlabeled and external labelling is impossible due to its private nature. This project aims to train and examine a self-supervised representation encoder on distributed and unlabeled image data. We create a federated implementation of the contrastive learning framework SimCLR and compare its performance to the traditional central version. We use federated averaging to create a federated implementation of SimCLR. Within the SimCLR framework, we test two different model types for the encoder (ResNet-18 and AlexNet). The encoders are trained in two different federated settings: i.i.d., where all clients have data from the same distribution, and non-i.i.d., where the client data distributions are completely disjoint. We also create a non-federated implementation trained on the same data, to compare the impact of federation on SimCLR. The quality of the representations is measured by the accuracy of a linear classifier trained on a small, labelled data set. We find that the best type of federated encoder has an average classifier accuracy of 67.0 % in the i.i.d. setting. This is only a small drop from the non-federated implementation, which reaches 69.0 %. However, the encoders trained in a non-i.i.d. setting have a lower average accuracy at 62.3 %. So, while a federated model has the capacity to perform on the level of a central one, a challenge in real world federated applications may be unbalanced data distributions.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)