Speaker verification: Advantages and limitations of a biologically inspired feature extractor

University essay from Lunds universitet/Institutionen för elektro- och informationsteknik

Abstract: Speaker verification is the process of verifying the identity of a person based on voice. This process usually encompasses the following steps: The speech signal is mapped into features using a feature extractor, these features are then classified using a post processor. The most common features used in speaker verification today are STFT, MFBs, and MFCCs, that are different spectral representations of the speech signal. Recently, a biologically inspired feature extractor called the cuneate nucleus (CN) model, that outputs CN features, was created. The main goal of this Master thesis is to find an optimal ANN post processor for the CN features. Testing different models on both conventional features and CN features concluded that a CNN model and a LSTM model were most suitable. The performance result concluded that the CN features and STFT performed well on noisy data but worse on clean data compared to the MFCCs and MFBs. A statistical analysis of the features was conducted using cross correlation, average activity and entropy. The analysis concluded that the inherent dynamical properties of the CN features and STFT make the training process of an ANN difficult, and therefore performance on clean data is poor. On the other hand these dynamical properties is what allows the features to perform well on noise. In comparison, the MFCCs and MFBs have the opposite inherent properties and this allows them to have state-of-the-art performance on clean data but poor performance on noise data. This in turn means that a conventional ANN post processor can only provide limited performance for CN features, and that other post processor methods need to be developed to reach beyond that limit.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)