Discovering the 4D World Behind Any Video

Vickie Ye

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-162

August 8, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-162.pdf

As we begin to interact with AI systems, we need them to be able to interpret the visual world in 4D — that is, to perceive the geometry and motion in the world. However, pixel differences in image space result from either geometry (via camera motion) or scene motion in the world. To disentangle these two sources this from a single video is extremely under-constrained.

In this thesis, I build several systems that recover scene representations from limited image obser- vations. Specifically, I study a series of problems that build toward the 4D monocular recovery problem, each one addressing a different aspect of the under-constrained nature of the problem.

First I study the problem of recovering shape from under-constrained inputs, without scene motion. Specifically, I present pixelNeRF, a method to synthesize novel views of a static scene from single or few views. We learn a scene prior by training a 3D neural representation conditioned on image features across multiple scenes. This learned scene prior enables 3D scene completion from the under-constrained inputs of single or few images. Next I study the problem of recovering motion without 3D shape. In particular, I present Deformable Sprites, a method to extract persistent elements of a dynamic scene from an input video. We represent each element as 2D image layers that deform across the video.

Finally I present two studies of performing the joint recovery of both the shape and motion of the 4D world from any single video. I first study the special case of dynamic humans, and present SLAHMR, in which we recover from a single video the global poses of all the humans and the camera in the world coordinate frame. I then move on to the general case of recovering any dynamic objects from a single video in Shape of Motion, in which we recover the entire scene as 4D gaussians, which we can use for dynamic novel view synthesis and 3D tracking.

Advisors: Angjoo Kanazawa

BibTeX citation:

@phdthesis{Ye:EECS-2024-162,
Author= {Ye, Vickie},
Title= {Discovering the 4D World Behind Any Video},
School= {EECS Department, University of California, Berkeley},
Year= {2024},
Month= {Aug},
Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-162.html},
Number= {UCB/EECS-2024-162},
Abstract= {As we begin to interact with AI systems, we need them to be able to interpret the visual world in 4D — that is, to perceive the geometry and motion in the world. However, pixel differences in image space result from either geometry (via camera motion) or scene motion in the world. To disentangle
these two sources this from a single video is extremely under-constrained.

EndNote citation:

%0 Thesis
%A Ye, Vickie 
%T Discovering the 4D World Behind Any Video
%I EECS Department, University of California, Berkeley
%D 2024
%8 August 8
%@ UCB/EECS-2024-162
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-162.html
%F Ye:EECS-2024-162