Deep Generative Priors for View Synthesis at Scale

Hang Gao and Angjoo Kanazawa and Jitendra Malik and Alexei (Alyosha) Efros and Shubham Tulsiani

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2025-170

August 19, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-170.pdf

View synthesis—the task of generating photorealistic images of a scene from novel camera viewpoints—is a cornerstone of computer vision, underpinning graphics, immersive reality, and embodied AI. Yet despite its importance, view synthesis has not demonstrated scaling properties comparable to those in language or 2D generation, even when provided with more data and compute: reconstruction-based methods collapse under sparse views or scene motion, while generative models struggle with 3D consistency and precise camera control.

This thesis shows that deep generative priors—instantiated as diffusion models conditioned on camera poses—bridge this gap. We proceed in three steps. First, we start by revealing that state-of-the-art dynamic view-synthesis benchmarks quietly rely on multi-view cues; removing those cues triggers steep performance drops and exposes the brittleness of reconstruction-based models. Then, we present a working solution that injects learned monocular depth and long-range tracking priors into a dynamic 3D Gaussian scene representation, recovering globally consistent geometry and motion from a single video. Finally, we abandon explicit reconstruction altogether, coupling camera-conditioned diffusion with a two-pass sampling strategy to synthesize minute-long, camera-controlled videos from as little as one input image.

From diagnosing the limits of reconstruction, to augmenting it with data-driven regularizers, to replacing it with a fully generative pipeline, our results trace a clear progression that delivers state-of-the-art fidelity, temporal coherence, and camera control precision while requiring orders-of- magnitude less input signal. We conclude by outlining open challenges and future directions for scaling view synthesis to truly world-scale 3D environments.

Advisors: Angjoo Kanazawa

BibTeX citation:

@phdthesis{Gao:EECS-2025-170,
    Author= {Gao, Hang and Kanazawa, Angjoo and Malik, Jitendra and Efros, Alexei (Alyosha) and Tulsiani, Shubham},
    Title= {Deep Generative Priors for View Synthesis at Scale},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Month= {Aug},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-170.html},
    Number= {UCB/EECS-2025-170},
    Abstract= {View synthesis—the task of generating photorealistic images of a scene from novel camera viewpoints—is a cornerstone of computer vision, underpinning graphics, immersive reality, and embodied AI. Yet despite its importance, view synthesis has not demonstrated scaling properties comparable to those in language or 2D generation, even when provided with more data and compute: reconstruction-based methods collapse under sparse views or scene motion, while generative models struggle with 3D consistency and precise camera control.

This thesis shows that deep generative priors—instantiated as diffusion models conditioned on camera poses—bridge this gap. We proceed in three steps. First, we start by revealing that state-of-the-art dynamic view-synthesis benchmarks quietly rely on multi-view cues; removing those cues triggers steep performance drops and exposes the brittleness of reconstruction-based models. Then, we present a working solution that injects learned monocular depth and long-range tracking priors into a dynamic 3D Gaussian scene representation, recovering globally consistent geometry and motion from a single video. Finally, we abandon explicit reconstruction altogether, coupling camera-conditioned diffusion with a two-pass sampling strategy to synthesize minute-long, camera-controlled videos from as little as one input image.

From diagnosing the limits of reconstruction, to augmenting it with data-driven regularizers, to replacing it with a fully generative pipeline, our results trace a clear progression that delivers state-of-the-art fidelity, temporal coherence, and camera control precision while requiring orders-of- magnitude less input signal. We conclude by outlining open challenges and future directions for scaling view synthesis to truly world-scale 3D environments.},
}

EndNote citation:

%0 Thesis
%A Gao, Hang 
%A Kanazawa, Angjoo 
%A Malik, Jitendra 
%A Efros, Alexei (Alyosha) 
%A Tulsiani, Shubham 
%T Deep Generative Priors for View Synthesis at Scale
%I EECS Department, University of California, Berkeley
%D 2025
%8 August 19
%@ UCB/EECS-2025-170
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-170.html
%F Gao:EECS-2025-170