Mispronunciation Detection with SpeechBlender Data Augmentation Pipeline

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: The rise of multilingualism has fueled the demand for computer-assisted pronunciation training (CAPT) systems for language learning, CAPT systems make use of speech technology advancements and offer features such as learner assessment and curriculum management. Mispronunciation detection (MD) is a crucial aspect of CAPT, aimed at identifying and correcting mispronunciations in second language learners’ speech. One of the significant challenges in developing MD models is the limited availability of labeled second-language speech data. To overcome this, the thesis introduces SpeechBlender - a fine-grained data augmentation pipeline designed to generate mispronunciations. The SpeechBlender targets different regions of a phonetic unit and blends raw speech signals through linear interpolation, resulting in erroneous pronunciation instances. This method provides a more effective sample generation compared to traditional cut/paste methods. The thesis explores also the use of pre-trained automatic speech recognition (ASR) systems for mispronunciation detection (MD), and examines various phone-level features that can be extracted from pre-trained ASR models and utilized for MD tasks. An deep neural model was proposed, that enhance the representations of extracted acoustic features combined with positional phoneme embeddings. The efficacy of the augmentation technique is demonstrated through a phone-level pronunciation quality assessment task using only non-native good pronunciation speech data. Our proposed technique achieves state-of-the-art results, with Speechocean762 Dataset [54], on ASR dependent MD models at phoneme level, with a 2.0% gain in Pearson Correlation Coefficient (PCC) compared to the previous state-of-the-art [17]. Additionally, we demonstrate a 5.0% improvement at the phoneme level compared to our baseline. In this thesis, we developed the first Arabic pronunciation learning corpus for Arabic AraVoiceL2 to demonstrate the generality of our proposed model and augmentation technique. We used the corpus to evaluate the effectiveness of our approach in improving mispronunciation detection for non-native Arabic speakers learning. Our experiments showed promising results, with a 4.6% increase in F1-score for the Arabic AraVoiceL2 testset, demonstrating the effectiveness of our model and augmentation technique in improving pronunciation learning for non-native speakers of Arabic.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)