Text and Speech Alignment Methods for Speech Translation Corpora Creation : Augmenting English LibriVox Recordings with Italian Textual Translations

University essay from Uppsala universitet/Institutionen för lingvistik och filologi

Abstract: The recent uprise of end-to-end speech translation models requires a new generation of parallel corpora, composed of a large amount of source language speech utterances aligned with their target language textual translations. We hereby show a pipeline and a set of methods to collect hundreds of hours of English audio-book recordings and align them with their Italian textual translations, using exclusively public domain resources gathered semi-automatically from the web. The pipeline consists in three main areas: text collection, bilingual text alignment, and forced alignment. For the text collection task, we show how to automatically find e-book titles in a target language by using machine translation, web information retrieval, and named entity recognition and translation techniques. For the bilingual text alignment task, we investigated three methods: the Gale–Church algorithm in conjunction with a small-size hand-crafted bilingual dictionary, the Gale–Church algorithm in conjunction with a bigger bilingual dictionary automatically inferred through statistical machine translation, and bilingual text alignment by computing the vector similarity of multilingual embeddings of concatenation of consecutive sentences. Our findings seem to indicate that the consecutive-sentence-embeddings similarity computation approach manages to improve the alignment of difficult sentences by indirectly performing sentence re-segmentation. For the forced alignment task, we give a theoretical overview of the preferred method depending on the properties of the text to be aligned with the audio, suggesting and using a TTS-DTW (text-to-speech and dynamic time warping) based approach in our pipeline. The result of our experiments is a publicly available multi-modal corpus composed of about 130 hours of English speech aligned with its Italian textual translation and split in 60561 triplets of English audio, English transcript, and Italian textual translation. We also post-processed the corpus so as to extract 40-MFCCs features from the audio segments and released them as a data-set.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)