Interpreting the Inner-Workings of Vision Models

Yossi Gandelsman

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2025-44

May 7, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-44.pdf

The field of computer vision has recently transitioned from hand-engineering systems to learning them from large-scale datasets via deep learning. This shift motivates a new kind of observational science — closer in spirit to experimental biology than traditional engineering — which aims to discover what is being learned by deep learning models and why these models work. This science analyzes the emergent internal computation in deep vision models, hoping to discover the basic computational blocks that enable visual intelligence.

This thesis presents my initial steps in this observational AI science, focusing on interpreting the internal mechanisms of deep vision models. It showcases how this understanding is used to improve model generalization and unlock new tasks without any additional learning.

I begin with an in-depth analysis of a single vision-language model, CLIP-ViT, and attempt to explain the functionality of two main components in its vision encoder — the attention heads and the neurons. I show that automatic characterization of components is attainable and reveals surprisingly structured and interpretable behavior, such as heads specializing and polysemantic neuron roles. These interpretations enable the removal of spurious features from CLIP, zero-shot image segmentation, and automatic generation of adversarial images. Next, I show that some similar computational components, “Rosetta Neurons”, emerge across a diverse set of models trained with different architectures, objectives, and supervision. These findings suggest that certain visual concepts and structures are inherently embedded in the natural world and can be learned by different models regardless of the specific task or architecture. That provides a path to a scalable understanding of vision models that can be used to repair and improve future models.

Advisors: Alexei (Alyosha) Efros

BibTeX citation:

@phdthesis{Gandelsman:EECS-2025-44,
    Author= {Gandelsman, Yossi},
    Title= {Interpreting the Inner-Workings of Vision Models},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-44.html},
    Number= {UCB/EECS-2025-44},
    Abstract= {The field of computer vision has recently transitioned from hand-engineering systems to learning them from large-scale datasets via deep learning. This shift motivates a new kind of observational science — closer in spirit to experimental biology than traditional engineering — which aims to discover what is being learned by deep learning models and why these models work. This science analyzes the emergent internal computation in deep vision models, hoping to discover the basic computational blocks that enable visual intelligence. 

This thesis presents my initial steps in this observational AI science, focusing on interpreting the internal mechanisms of deep vision models. It showcases how this understanding is used to improve model generalization and unlock new tasks without any additional learning.

I begin with an in-depth analysis of a single vision-language model, CLIP-ViT, and attempt to explain the functionality of two main components in its vision encoder — the attention heads and the neurons. I show that automatic characterization of components is attainable and reveals surprisingly structured and interpretable behavior, such as heads specializing and polysemantic neuron roles. These interpretations enable the removal of spurious features from CLIP, zero-shot image segmentation, and automatic generation of adversarial images. Next, I show that some similar computational components, “Rosetta Neurons”, emerge across a diverse set of models trained with different architectures, objectives, and supervision. These findings suggest that certain visual concepts and structures are inherently embedded in the natural world and can be learned by different models regardless of the specific task or architecture. That provides a path to a scalable understanding of vision models that can be used to repair and improve future models.},
}

EndNote citation:

%0 Thesis
%A Gandelsman, Yossi 
%T Interpreting the Inner-Workings of Vision Models
%I EECS Department, University of California, Berkeley
%D 2025
%8 May 7
%@ UCB/EECS-2025-44
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-44.html
%F Gandelsman:EECS-2025-44