Do Vision and Language Encoders Represent the World Similarly?

Raiymbek Akshulakov

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-105

May 15, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-105.pdf

Aligned text-image encoders such as CLIP have become the de-facto model for visionlanguage tasks. Furthermore, modality-specific encoders achieve impressive performances in their respective domains. This raises a central question: does an alignment exist between uni-modal vision and language encoders since they fundamentally represent the same physical world? Analyzing the latent spaces structure of vision and language models on image-caption benchmarks using the Centered Kernel Alignment (CKA), we find that the representation spaces of unaligned and aligned encoders are semantically similar. In the absence of statistical similarity in aligned encoders like CLIP, we show that a possible matching of unaligned encoders exists without any training. We frame this as a seeded graph-matching problem exploiting the semantic similarity between graphs and propose two methods - a Fast Quadratic Assignment Problem optimization, and a novel localized CKA metric-based matching/retrieval. We demonstrate the effectiveness of this on several downstream tasks including cross-lingual, cross-domain caption matching and image classification. Code available at github.com/mayug/0-shot-llm-vision.

Advisors: Jitendra Malik

BibTeX citation:

@mastersthesis{Akshulakov:EECS-2024-105,
    Author= {Akshulakov, Raiymbek},
    Title= {Do Vision and Language Encoders Represent the World Similarly?},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-105.html},
    Number= {UCB/EECS-2024-105},
    Abstract= {Aligned text-image encoders such as CLIP have become the de-facto model for visionlanguage
tasks. Furthermore, modality-specific encoders achieve impressive performances
in their respective domains. This raises a central question: does an alignment exist between
uni-modal vision and language encoders since they fundamentally represent the same
physical world? Analyzing the latent spaces structure of vision and language models on
image-caption benchmarks using the Centered Kernel Alignment (CKA), we find that the
representation spaces of unaligned and aligned encoders are semantically similar. In the absence
of statistical similarity in aligned encoders like CLIP, we show that a possible matching
of unaligned encoders exists without any training. We frame this as a seeded graph-matching
problem exploiting the semantic similarity between graphs and propose two methods - a
Fast Quadratic Assignment Problem optimization, and a novel localized CKA metric-based
matching/retrieval. We demonstrate the effectiveness of this on several downstream tasks
including cross-lingual, cross-domain caption matching and image classification. Code available
at github.com/mayug/0-shot-llm-vision.},
}

EndNote citation:

%0 Thesis
%A Akshulakov, Raiymbek 
%T Do Vision and Language Encoders Represent the World Similarly?
%I EECS Department, University of California, Berkeley
%D 2024
%8 May 15
%@ UCB/EECS-2024-105
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-105.html
%F Akshulakov:EECS-2024-105