Rui Gong

Master Thesis
Supervisors: Dr. Wen Li, Yuhua Chen, and Prof. Luc Van Gool

3D Construction for Indoor Scene Understanding

Human-like visual understanding of a scene is a fundamental yet challenging problem in machine perception. The process involves reasoning between knowledge of different types, e.g. semantic information, geometric structure, physical constraint, etc. Existing computer vision tasks (classification, detection, 3D reconstruction etc.) reflect only partial humans’ ability of visual understanding. In this thesis, we are targeting a holistic scene understanding through ”3D construction”. We define the ”3D construction” task as constructing the scene from a database of 3D CAD models based on an input 2D RGB image. The aim is to maximize the similarity between the scene in input image and the constructed scene, by drawing the 3D CAD models from the database to represent the objects in the scene and arranging the CAD models in 3D space properly according to the 3D attributes (translation, rotation, scale). Towards the goal, we present an end-to-end framework for constructing the scene automatically. The framework mainly consists of three modules: object detection module, 3D attributes estimation module and 3D CAD model selection module. Given the 2D RGB scene image, the object detection module aims to predict semantic category and bounding box for each object in the input image. Then the RGB image and detection results are used as input for the other two modules. The 3D attributes estimation module predicts the translation, rotation and scale of each object in the camera coordinate frame. The 3D CAD model selection module chooses the most appropriate 3D CAD models from the database through a novel sample-efficient REINFORCE algorithm. We conduct our experiments on PBR dataset, where our 3D construction framework is able to predict the 3D attributes (translation, rotation, scale) and select the appropriate 3D CAD models from the database. The results indicate that our framework may provide a promising way to effectively construct the 3D scene automatically from the image.