A Data-Driven Approach For Automatic Visual Speech In Swedish Speech Synthesis Applications
Abstract: This project investigates the use of artificial neural networks for visual speech synthesis. The objective was to produce a framework for animated chat bots in Swedish. A survey of the literature on the topic revealed that the state-of-the-art approach was using ANNs with either audio or phoneme sequences as input. Three subjective surveys were conducted, both in the context of the final product, and in a more neutral context with less post-processing. They compared the ground truth, captured using the deep-sensing camera of the iPhone X, against both the ANN model and a baseline model. The statistical analysis used mixed effects models to find any statistically significant differences. Also, the temporal dynamics and the error were analyzed. The results show that a relatively simple ANN was capable of learning a mapping from phoneme sequences to blend shape weight sequences with satisfactory results, except for the fact that certain consonant requirements were unfulfilled. The issues with certain consonants were also observed in the ground truth, to some extent. Post-processing with consonant-specific overlays made the ANN’s animations indistinguishable from the ground truth and the subjects perceived them as more realistic than the baseline model’s animations. The ANN model proved useful in learning the temporal dynamics and coarticulation effects for vowels, but may have needed more data to properly satisfy the requirements of certain consonants. For the purposes of the intended product, these requirements can be satisfied using consonant-specific overlays.
AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)