Visual Intelligence Beyond Human Supervision

Xudong Wang

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2025-148

August 10, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-148.pdf

Achieving artificial general intelligence requires developing models capable of perceiving, understanding, and interacting with the world across diverse sensory modalities—beyond the confines of language alone. While self-supervised learning has enabled remarkable advances in large language models (LLMs), replicating this success in the visual domain remains a significant challenge, largely due to the continued reliance on human-annotated data. This dissertation explores how self-supervised learning can unlock visual intelligence beyond human supervision, enabling models to learn directly from the inherent structure and regularities of the visual world.

The thesis presents a series of efforts aimed at advancing this vision. First, it investigates self-supervised visual world understanding, demonstrating that models can achieve strong segmentation performance without the billions of labeled masks used in supervised approaches such as the Segment Anything Model (SAM). Instead, our work shows that models can ``segment anything'' by leveraging the rich semantics present in unlabeled data. Second, it introduces methods that unify generative and discriminative visual models through self-supervision and synthetic data, allowing these systems to complement one another and improve both visual understanding and generation. Third, the dissertation examines how to build robust visual models through self-supervised debiased learning, proposing techniques that mitigate bias and enhance generalization under imperfect data conditions, within a data-centric representation learning framework.

Together, these contributions serve a common goal: building scalable, multi-modality visual intelligence that learn not by mimicking human annotations, but by discovering the latent structure of the world itself!

Advisors: Trevor Darrell

BibTeX citation:

@phdthesis{Wang:EECS-2025-148,
    Author= {Wang, Xudong},
    Title= {Visual Intelligence Beyond Human Supervision},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Month= {Aug},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-148.html},
    Number= {UCB/EECS-2025-148},
    Abstract= {Achieving artificial general intelligence requires developing models capable of perceiving, understanding, and interacting with the world across diverse sensory modalities—beyond the confines of language alone. While self-supervised learning has enabled remarkable advances in large language models (LLMs), replicating this success in the visual domain remains a significant challenge, largely due to the continued reliance on human-annotated data. This dissertation explores how self-supervised learning can unlock visual intelligence beyond human supervision, enabling models to learn directly from the inherent structure and regularities of the visual world.

The thesis presents a series of efforts aimed at advancing this vision. First, it investigates self-supervised visual world understanding, demonstrating that models can achieve strong segmentation performance without the billions of labeled masks used in supervised approaches such as the Segment Anything Model (SAM). Instead, our work shows that models can ``segment anything'' by leveraging the rich semantics present in unlabeled data. Second, it introduces methods that unify generative and discriminative visual models through self-supervision and synthetic data, allowing these systems to complement one another and improve both visual understanding and generation. Third, the dissertation examines how to build robust visual models through self-supervised debiased learning, proposing techniques that mitigate bias and enhance generalization under imperfect data conditions, within a data-centric representation learning framework.

Together, these contributions serve a common goal: 
building scalable, multi-modality visual intelligence that learn not by mimicking human annotations, but by discovering the latent structure of the world itself!},
}

EndNote citation:

%0 Thesis
%A Wang, Xudong 
%T Visual Intelligence Beyond Human Supervision
%I EECS Department, University of California, Berkeley
%D 2025
%8 August 10
%@ UCB/EECS-2025-148
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-148.html
%F Wang:EECS-2025-148