Generating Facial Animation With Emotions In A Neural Text-To-Speech Pipeline

University essay from Linköpings universitet/Medie- och InformationsteknikLinköpings universitet/Tekniska högskolan

Abstract: This thesis presents the work of incorporating facial animation with emotions into a neural text-to-speech pipeline. The project aims to allow for a digital human to utter sentences given only text, removing the need for video input. Our solution consists of a neural network able to generate blend shape weights from speech which is placed in a neural text-to-speech pipeline. We build on ideas from previous work and implement a recurrent neural network using four LSTM layers and later extend this implementation by incorporating emotions. The emotions are learned by the network itself via the emotion layer and used at inference to produce the desired emotion. While using LSTMs for speech-driven facial animation is not a new idea, it has not yet been combined with the idea of using emotional states that are learned by the network itself. Previous approaches are either only two-dimensional, of complicated design or require manual laboring of the emotional states. Thus, we implement a network of simple design, taking advantage of the sequence processing ability of LSTMs and combines it with the idea of emotional states. We trained several variations of the network on data captured using a head mounted camera, and the results of the best performing model were used in a subjective evaluation. During the evaluation the participants were presented several videos and asked to rate the naturalness of the face uttering the sentence. The results showed that the naturalness of the face greatly depends on which emotion vector was used, as some vectors limited the mobility of the face. However, our best achieving emotion vector was rated at the same level of naturalness as the ground truth, proving our method successful. The purpose of the thesis was fulfilled as our implementation demonstrates one possibility of incorporating facial animation into a text-to-speech pipeline.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)