Bernhard Kratzwald

Master Thesis
Supervisors: Dr. Zhiwu Huang, Dr. Danda Pani Paudel, and Prof. Luc Van Gool

Towards an Understanding of Our World by GANing Videos in the Wild

Videos contain rich information about our world, particularly how objects behave, move, occlude, deform and interact with each other. Fully understanding the temporal and spatial dependencies in videos is one of the core-problems in computer vision. Unfortunately, the conventional supervised techniques for learning-from-videos are undermined by the difficulty of collecting sufficient labeled training data, mainly due to the high-dimensional nature of videos. Therefore, unsupervised learning from videos, such as generative models, is a research topic of very high importance. However, existing generative models work well only for the videos with static backgrounds. For dynamic scenes, applications of these models strictly demand an extra pre-processing step of background stabilization. In fact, the task of background stabilization may very often be impossible for the videos in the wild. In this work, we present the first video generation framework that works in the wild, without making any assumption of static backgrounds or whatsoever. This allows us to avoid the background stabilization step, altogether. The proposed method not only works well with the videos in the wild, but also outperforms the state-of-the-art methods when the static backgrounds assumption is valid. This is achieved by designing a robust one-stream video generator architecture by exploiting Wasserstein GAN framework for better convergence. Since the proposed architecture is of one-stream, which does not formally distinguish between fore- and background, it can generate -- and learn from -- videos with dynamic backgrounds. The superiority of our model is demonstrated by successfully applying it to three challenging problems: video colorization, video inpainting, and future prediction.