Unsupervised Learning of 3D Object Categories from Videos in the Wild

1University College London, 2Meta AI
CVPR 2021

Overview

Teaser.

We present a novel deep architecture that contributes Warp-conditioned Ray Embedding (WCR) to reconstruct and render new views (right) of object categories from one or few input images (middle). Our model is learned automatically from videos of the objects (left) and works on difficult real data where competitor architectures fail to produce good results


Method

Our method takes as input an image and produces per pixel features using a U-Net. We then shoot rays from a target view and retrieve per-pixel features from one or multiple source images. Once all spatial feature vectors are aggregated into a single feature vector, we combine them with their harmonic embeddings and pass them to an MLP yielding per location colors and opacities. Finally, we use differentiable raymarching to produce a rendered image.

Overview.

Dataset

For the purpose of studying learning 3D object categories in the wild, we crowd-sourced a large collection of videos from AMT.

Camera tracks.

Results

Single image reconstruction

Source image Mesh Voxel Voxel+MLP MLP Ours

Multi-view reconstruction

Donut

# Source images
#1
#2
#3
#4
#5
#6
#7
# Source images used for reconstruction
#1
#3
#5
#7

Hydrant

# Source images
#1
#2
#3
#4
#5
#6
#7
# Source images used for reconstruction
#1
#3
#5
#7

BibTeX

@article{henzler2021unsupervised,
    author    = {Henzler, Philipp and Reizenstein, Jeremy and Labatut, Patrick and Shapovalov, Roman and Ritschel, Tobias and Vedaldi, Andrea and Novotny, David},
    title     = {Unsupervised Learning of 3D Object Categories from Videos in the Wild},
    journal   = {CVPR},
    year      = {2021},
  }