Supervisors: Dr. Shuhang Gu, Dr. Radu Timofte
Different from the image-to-image translation task, the video-to-video translation task needs to preserve spatio-temporal consistency between frames in addition to ensuring that each video frame looks photorealistic, which makes it more complex and challenging. In the video domain, paired data is hard or even impossible to get sometimes. As a result, the video-to-video translation needs to be conducted in an unsupervised way. However, existing methods on unsupervised video translation fail to produce translated video which is frame-wise realistic, video-level consistent and dynamically vivid. Besides, they fail to reach a balance between multimodality across translations and style consistency within a translation. In this work, we propose a recurrent-based multimodal unsupervised video translation model. Our model can produce frame-wise realistic, spatio-temporal consistent video translation in a multimodal way. The recurrent frame generation approach also enables us to generate lengthy video sequences. Experiments show the advantage and further validate the superiority of our model compared to the baselines.