Self-Supervised Shape and Appearance Modeling via Neural Differentiable Graphics


Inferring 3D shape and appearance from natural images is a fundamental challenge in computer vision. Despite recent progress using deep learning methods, a key limitation is the availability of annotated training data, as acquisition is often very challenging and expensive, especially at a large scale. This thesis proposes to incorporate physical priors into neural networks that allow for self-supervised learning. As a result, easy-to-access unlabeled data can be used for model training. In particular, novel algorithms in the context of 3D reconstruction and texture/material synthesis are introduced, where only image data is available as supervisory signal. First, a method that learns to reason about 3D shape and appearance solely from unstructured 2D images, achieved via differentiable rendering in an adversarial fashion, is proposed. As shown next, learning from videos significantly improves 3D reconstruction quality. To this end, a novel ray-conditioned warp embedding is proposed that aggregates pixel-wise features from multiple source images. Addressing the challenging task of disentangling shape and appearance, first a method that enables 3D texture synthesis independent of shape or resolution is presented. For this purpose, 3D noise fields of different scales are transformed into stationary textures. The method is able to produce 3D textures, despite only requiring 2D textures for training. Lastly, the surface characteristics of textures under different illumination conditions are modeled in the form of material parameters. Therefore, a self-supervised approach is proposed that has no access to material parameters but only flash images. Similar to the previous method, random noise fields are reshaped to material parameters, which are conditioned to replicate the visual appearance of the input under matching light.