Swedish Language End-to-End Automatic Speech Recognition for Media Monitoring using Deep Learning

University essay from Luleå tekniska universitet/Institutionen för system- och rymdteknik

Abstract: In order to extract relevant information from speech recordings, the general approach is to first convert the audio into transcribed text. The text can then be analysed using well researched methods. NewsMachine AB provides customers with an overview of how they are represented in media by analysing articles in text form. Their plans to scale up their monitoring of publicly available speech recordings was the basis for the thesis. In this thesis I compare three end-to-end Automatic Speech Recognition (ASR) models. I do so in order to find the model that currently works best for transcribing Swedish language radio recordings, considering accuracy and inference speed (computational complexity). The results show that the QuartzNet architecture is the fastest, but pre-trained wav2vec models provided by KBLab on Swedish speech have by far the best accuracy. The KBLab model was used for further fine-tuning on subsets with varying amount of training data from radio recordings. The results show that further fine-tuning the KBLab models on low-resource Swedish speech domains achieves impressive accuracy. With just 5 hours of training data, the result is 11.5% Word Error Rate and 3.8% Character Error Rate. A final model was fine-tuned on all 35 hours of the radio domain dataset, resulting in model achieving 10.4% Word Error Rate and 3.5% Character Error Rate. The thesis presents a complete pipeline able to convert any length of audio into a transcription. Segmentation of audio is performed as a pre-processing step, segmenting the audio based on silence. The silence represents when a sentence stops and a new begins. The audio segments are passed to the final fine-tuned ASR model, and are concatenated for the complete punctuated transcript. This implementation allowed for punctuation, and also timestamping, when sentences occur in the audio. The results show that the complete pipeline performs well on high quality audio recordings. But when introduced to noisy and disruptive audio, there is work needed to achieve optimal performance.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)