Deep Learning for Speech Enhancement : A Study on WaveNet, GANs and General CNN-RNN Architectures
Abstract: Clarity and intelligiblity are important aspects of speech, especially in a time of misinformation and mistrust. The breakthrough in generative models for audio files has brought massive improvements for speech enhancement. Google’s WaveNet architecture has been modified for noise reduction in a model called WaveNet denoising and has proven to be state-of-the-art. Another competitor on the market would be the Speech Enhancement Generative Adversarial Network (SEGAN) which adapts the GAN architecture into applications on speech. While most older models focus on feature extraction and spectrogram analysis, these two models attempt to skip those steps and become end-to-end models completely. While end-to-end is good, data preprocessing is still a valuable asset to consider. A network designed by Microsoft Research called EHNet uses the spectrogram data as input instead of the mere 1D waveforms to capture more relations between datapoints as a higher dimension can enable more information. This thesis aims to explore the speech enhancement field of study from a deep learning perspective and focus on the three mentioned architectures in theory dissection and results from new datasets. There is also an implementation of the Wiener filter as a benchmark. We arrive at the conclusion that all three networks are viable in the task of enhancing speech, however SEGAN performed better on our dataset and was more robust to new data in comparison. For future work one could improve the evaluation methods, change datasets and implement hyperparameter optimization for further comparative analysis.
AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)