Generative Adversarial Networks in Lip-Synchronized Deepfakes for Personalized Video Messages

University essay from Lunds universitet/Matematik LTH

Abstract: The recent progress of deep learning has enabled more powerful frameworks to create good-quality deepfakes. Deepfakes, which are mostly known for malicious purposes, have great potential to be useful in areas such as the movie industry, education, and personalized messaging. This thesis focus on lip-synchronization, which is a part of a broader pipeline to develop personalized video messages, using deepfakes. For this application, the deep learning framework Generative Adversarial Networks (GAN), adapted to a given audio and video input, was used. The objectives were to implement a structure to perform lip-synchronization, investigate what variations of GANs excel at this task, and also how different datasets impact the results. Three different models were investigated: firstly, the GAN architecture LipGAN was reimplemented in Pytorch, secondly, a GAN variation, WGAN-GP, was adapted to the LipGAN architecture, and thirdly, a novel approach that takes inspiration from both models, L1WGAN-GP, was developed and implemented. All models were trained using the dataset GRID and benchmarked by the metrics PSNR, SSIM, and FID-score. Lastly, the influence of the training dataset was tested by comparing our implementation of LipGAN with another implementation trained on another dataset, LRS2. WGAN-GP did not converge and resulted in suspected mode collapse. For the two other models, we showed that the LipGAN implementation performed best in terms of PSNR and SSIM, whereas L1WGAN-GP performed better than LipGAN according to the FID-score. Yet, L1WGAN-GP produced samples that were polluted by artifacts. Our models trained on the GRID dataset showed bad generalization performance compared to the same model trained on LRS2. Additionally, the models trained on less amount of data were outperformed by models that were trained on the full dataset. Finally, our results suggest that LipGAN was the best performing network, and with it we successfully managed to produce satisfying lip-synchronization.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)