Instability of a bi-directional TiFGAN in unsupervised speech representation learning

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Author: Matthaios Stylianidis; [2021]

Keywords: ;

Abstract: A major challenge in the application of machine learning in the speech domain is the unavailability of annotated data. Supervised machine learning techniques are highly dependent on the amount of labelled data and the quality of the labels. On the other hand, unsupervised training methods do not require labels and hence allow for the use of much larger unlabelled datasets. In this thesis work we investigate the use of an unsupervised training method for learning representations of speech data. More specifically, we extend an existing Wasserstein Generative Adversarial Network (WGAN) architecture called the Time-Frequency GAN (TiFGAN), originally purposed for unconditional speech generation, into a bi-directional architecture capable of learning representations. We investigate the abilities of our proposed bi-directional architecture (BiTiFGAN) in learning speech representations by evaluating the learned representations in the supervised task of keyword detection using the Speech Commands dataset. We observe that the training of our model is characterized by instability and in an attempt to stabilize training we try several different configurations for our architecture and training parameters. Mode collapse in the encoder is a common problem across our experiments, decreasing the performance acquired with the learned representations and making training unstable. Nonetheless, by increasing the capacity of our BiTiFGAN discriminator we successfully learn representations that are competitive when compared to our baseline representations such as the Mel-frequency Cepstrum Coefficients (MFCC) or Filter Bank Energy (FBANK) features. 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)