Voice Feature Extraction Using Siamese Neural Networks for Detecting Impersonators

University essay from Lunds universitet/Matematisk statistik

Abstract: Voice impersonation is a technique that has often been used by criminals whose goal is to avoid being identified while committing a crime. There are, however, other interesting cases where the police confronts a suspect with an incriminating recording, and the suspect would deny being the true speaker in that recording, and claim that it belonged to an expert impersonator. In both of these cases, it would be very helpful for the police to be able to predict with high probability whether a recording belongs to the true speaker or an impersonator. This thesis aims to use neural networks to extract the most significant features in recognizing a unique voice, and then use them to classify whether a recording belongs to a true speaker or somebody impersonating them. In order to achieve this, we first extract the raw audio features that are commonly used in speech recognition, the majority of which are spectral features, then feed these features to a Siamese Neural Network to generate an encoding that best represent a recording of a person's voice. The structure of a Siamese neural network is determined by the type of loss function being used. In this project, we compare the performances of different network structures as well as different classifiers used in classifying the speech from the encoding. We present our approach and results on the data consisting of recordings of prominent American political figures, their impersonators, and several other individuals.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)