Diffusion-based Vocoding for Real-Time Text-To-Speech

University essay from Lunds universitet/Matematisk statistik

Abstract: The emergence of machine learning based text-to-speech systems have made fully automated customer service voice calls, spoken personal assistants, and the creation of synthetic voices seem well within reach. However, there are still many technical challenges with creating such a system which can generate audio quickly and of high enough quality. One critical component of the typical text-to-speech pipeline is the vocoder, which is responsible for producing the final waveform in the process. This thesis investigates solving the vocoder problem using a statistical framework called diffusion, which is used to teach a neural network to sequentially transform noise into recorded speech. Experiments are done by extending the framework with three different theoretical improvements, and evaluating a range of different diffusion-based vocoders which use these improvements with respect to inference speed and audio quality. In addition to this, a new variant of one such improvement is proposed, called a ”variance schedule”, which is shown to perform on par with previously adopted methods. Greater training stability is also achieved via methods inspired by diffusion models for image generation. The extensions of the framework are found to have a mostly positive effect on model performance, and audio is shown to be able to be generated at a quality equal to current state-of-the-art vocoders based on Generative Adversarial Networks, but not at the same speeds. Furthermore, we find that it is possible for a diffusion-based vocoder to achieve a 12 times speed up while retaining a comparable audio quality, and are convinced that further speed ups are possible. Inference for a real-time text-to-speech application is thought to be viable using a graphics processing unit, but not a central processing unit.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)