Video Retargeting using Vision Transformers : Utilizing deep learning for video aspect ratio change

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: The diversity of video material, where a video is shot and produced using a single aspect ratio, and the variety of devices that can play video with screens in different aspect ratios make video retargeting a relevant topic. The process of fitting a video filmed in one aspect ratio to a screen in another aspect ratio is called video retargeting, and the retargeted video should ideally preserve the important content and structure of the original video as well as be free of visual artifacts. Important content and important structure are vague and subjective definitions, which makes this problem more difficult to solve. The video retargeting problem has been a challenge for researchers from the computer vision, computer graphics and human-computer interaction areas, and successful retargeting can improve the viewing experience and the content’s aesthetic value. Video retargeting is done by four tools: cropping, scaling, seam carving and seam adding. Previous research showed that one of the keys to successful retargeting is to use a suitable combination of operators. This study makes use of a vision transformer, a deep learning model which is trained to discriminate between original and retargeted videos. Solving an optimization problem using beam search, the transformer assists in choosing a combination of operators that will result in the best possible retargeted video. The retargeted videos were examined in a user A/B-test, where users had to choose their preferred variant of a video shot: the transformer’s output using beam search, or a singular version where the video underwent a single retargeting operation. The model and user preferences were compared to check if the model indeed can make retargeting decisions that are appealing for humans to watch. A significance test showed that no conclusion can be made, probably due to lack of enough test data. However, the study revealed patterns in the preferences of the users and the model that could be further fine-tuned or combined with other computer vision mechanisms in order to output better retargeted videos. 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)