Machine Learning for Text-Independent Speaker Verification : How to Teach a Machine to RecognizeHuman Voices

University essay from KTH/Skolan för elektro- och systemteknik (EES)

Author: Stefano Imoscopi; [2016]

Keywords: ;

Abstract: The aim of speaker recognition and veri cation is to identify people's identity from the characteristics of their voices (voice biometrics). Traditionally this technology has been employed mostly for security or authentication purposes, identi cation of employees/customers and criminal investigations. During the last decade the increasing popularity of hands-free and voice-controlled systems and the massive growth of media content generated on the internet has increased the need for techniques to automatically and accurately analyse speech signals. Speaker recognition is thus becoming a fundamental block for the smart analysis of speech in video and audio content, along with other technologies like speech recognition and diarization. Examples of useful applications of these technologies are query-by-voice, automatic subtitling and automatic metadata generation for movies and television. In this thesis we evaluate di erent state-of-the-art techniques for text-independent speaker veri cation on a large database of read English speech (LibriSpeech ASR corpus). The di erent techniques are compared in terms of classi cation accuracy, scalability and robustness to noise. A classi cation approach based on discriminatively trained Arti cial Neural Networks (ANNs) is presented, showing superior classi cation performance to traditional generative models like Gaussian Mixture Models (GMMs) and Ivectors. The core contribution of the thesis is a novel hybrid generative/discriminative method, using ANNs and a GMM-Universal Background Model (UBM) to obtain state-of-the-art speaker recognition results. The advantage of the new system is the possibility of using ANNs while maintaining complete scalability: an arbitrary number of new speakers can be added to the system without the need of retraining the speaker models. At the same time the system achieves very good performance, with only 0.23% Equal Error Rate (EER) in veri cation mode and 99.6% classi cation accuracy on a dataset of 2483 speakers, both male and female.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)