Experiments in speaker diarization using speaker vectors

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Author: Ming Cui; [2021]

Keywords: Speaker Diarization; Embedding Extraction Module; Deep Learning; Supervised method; Unsupervised method; Talardiarisering; inbäddning av extraktionsmodul; djupinlärning; övervakad metod; oövervakad metod;

Abstract: Speaker Diarization is the task of determining ‘who spoke when?’ in an audio or video recording that contains an unknown amount of speech and also an unknown number of speakers. It has emerged as an increasingly important and dedicated domain of speech research. Initially, it was proposed as a research topic related to automatic speech recognition, where speaker diarization serves as an upstream processing step. Over recent years, however, speaker diarization has become an important key technology for many tasks, such as navigation, retrieval, or higher-level inference on audio data. Our research focuses on the existing speaker diarization algorithms. Particularly, the thesis targets the differences between supervised and unsupervised methods. The aims of this thesis is to check the state-of-the-art algorithms and analyze which algorithm is most suitable for our application scenarios. Its main contributions are (1) an empirical study of speaker diarization algorithms; (2) appropriate corpus data pre-processing; (3) audio embedding network for creating d-vectors; (4) experiments on different algorithms and corpus and comparison of them; (5) a good recommendation for our requirements. The empirical study shows that, for embedding extraction module, due to the neural networks can be trained with big datasets, the diarization performance can be significantly improved by replacing i-vectors with d-vectors. Moreover, the differences between supervised methods and unsupervised methods are mostly in clustering module. The thesis only uses d-vectors as the input of diarization network and selects two main algorithms as compare objects: Spectral Clustering represents unsupervised method and Unbounded Interleaved-state Recurrent Neural Network (UIS-RNN) represents supervised method.

AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)

Experiments in speaker diarization using speaker vectors

Searchphrases right now

Popular searches

popular essays yesterday (2024-04-25)