Speaker Recognition using Biology-Inspired Feature Extraction

University essay from Lunds universitet/Institutionen för elektro- och informationsteknik

Abstract: Distinguishing between people's voices is something the human brain does naturally, using only frequencies picked up by the inner ear. The field of speaker recognition is concerned with making machines do the same thing using digitally sampled speech and data processing. The processing extracts relevant information about the speech from the high dimensional acoustic data which can help the machine understand to which speaker a speech sample belongs. Several methods exist to solve this problem, most of which are based on modelling a sample as a sequence of time frames, each representing the current frequency characteristics of the sound input. A common choice of frequency characteristics are Mel-Frequency Cepstral Coefficients (MFCC), which represent the overall shape of the frequency spectrum representation of the input during each time frame. This thesis presents a different approach, inspired by findings of how the human brain processes tactile sensory input, which lets an unsupervised learning model pick out important combinations of frequencies from the signal. These different combinations of frequencies arise because they have an observed spatiotemporal relationship across multiple data samples and speakers, in which their intensities correlate in time. Extracting spatiotemporal patterns between input frequencies as features instead of the overall spectrum shape can lead to new, more robust ways of encoding auditory data.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)