Representing Voices Using Convolutional Neural Network Embeddings

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Author: Niklas Embretsén; [2019]

Keywords: ;

Abstract: In today’s society services centered around voices are gaining popularity. Being able to provide the users with voices they like, to obtain and sustain their attention, is of importance for enhancing the overall experience of the service. Finding an efficient way of representing voices such that similarity comparisons can be performed is therefore of great use. In the field of Natural Language Processing great progress has been made using embeddings from Deep Learning models to represent words in an unsupervised fashion. These representations managed to capture the semantics of the words. This thesis sets out to explore whether such embeddings can be found for audio data as well, more specifically voices from narrators of audiobooks, that captures similarities between different voices. For this two different Convolutional Neural Networks are developed and evaluated, trained on spectrogram representations of the voices. One is performing regular classification while the other one uses pairwise relationships and a Kullback–Leibler divergence based loss function, in an attempt to minimize and maximize the difference of the output between similar and dissimilar pairs of samples. From these models the embeddings used to represent each sample are extracted from the different layers of the fully connected part of the network during the evaluation. Both an objective and a subjective evaluation is performed. During the objective evaluation of the models it is first investigated whether the found embeddings are distinct for the different narrators, as well as if the embeddings do encode information about gender. The regular classification model is then further evaluated through a user test, as it achieved an order of magnitude better results during the objective evaluation. The user test sets out to evaluate whether the found embeddings capture information based on perceived similarity. It is concluded that the proposed approach has the potential to be used for representing voices in a way such that similarity is encoded, although more extensive testing, research and evaluation has to be performed to know for sure. For future work it is proposed to perform more sophisticated pre-proceessing of the data and also to collect and include data about relationships between voices during the training of the models.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)