Towards Vision-LanguageFoundationModels: Limitations,Improvements, and Generalization

Simon Zhai

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2025-9

March 6, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-9.pdf

This dissertation investigates critical challenges in the development and training of multimodal foundation models, focusing on limitations in current supervised fine-tuning (SFT) approaches and exploring the potential of reinforcement learning (RL) to achieve robust generalization. The work is presented in two main parts:

Part 1: Understanding Limitations of Multimodal Foundation Models under Supervised Fine-Tuning

Despite their impressive capabilities on benchmark tasks, multimodal large language models (MLLMs) often exhibit surprising weaknesses when faced with seemingly simple tasks requiring deeper understanding or adaptation to novel situations. This dissertation first investigates the phenomenon of catastrophic forgetting in MLLMs, where fine-tuning on new tasks can lead to a significant decline in performance on previously learned tasks. We introduce the Evaluating MulTimodality (EMT) framework, a novel evaluation methodology designed to systematically assess this forgetting. Our findings reveal that even MLLMs leveraging powerful pre-trained vision encoders suffer from substantial performance degradation on basic image classification tasks after SFT. Furthermore, we delve into the specific visual shortcomings of MLLMs. We introduce the MultiModal Visual Patterns (MMVP) benchmark, a carefully curated set of visual question-answering tasks designed to probe the visual grounding capabilities of these models. The results demonstrate systematic failures in state-of-the-art MLLMs, highlighting a strong correlation between weaknesses in the underlying visual encoder (CLIP) and overall model performance. These findings suggest that current SFT approaches, while effective for task-specific adaptation, may not be sufficient for imbuing MLLMs with robust visual understanding and the ability to retain previously acquired knowledge.

Part 2: Enhancing Generalization through Reinforcement Learning

Recognizing the limitations of SFT, this dissertation then explores the potential of reinforcement learning (RL) to achieve more robust and generalizable multimodal intelligence. We propose a novel framework for fine-tuning large vision-language models (VLMs) with RL, enabling end-to-end training on tasks that require both visual understanding and language reasoning. A key component of this framework is the incorporation of chain-of-thought (CoT) prompting, which leverages the inherent reasoning capabilities of VLMs to facilitate more efficient exploration and learning. We conduct a comparative analysis of RL and SFT, focusing on generalization to unseen rule variations and novel visual contexts. The results demonstrate that RL fine-tuning consistently leads to superior generalization compared to SFT. Models trained with RL exhibit improved performance on tasks with modified rules, adapt more effectively to variations in visual input, and even show enhanced underlying visual recognition abilities. Furthermore, we investigate the role of inference-time computation, demonstrating that increasing the number of verification iterations during RL training further improves generalization. This highlights that while SFT provides a necessary foundation for instruction following, RL is crucial for achieving robust, adaptable performance in complex, dynamic environments.

In summary, this dissertation provides compelling evidence for the limitations of current SFT-based training of multimodal foundation models and showcases the significant potential of RL to overcome these limitations, paving the way for more generalizable and intelligent AI systems.

Advisors: Sergey Levine and Yi Ma

BibTeX citation:

@phdthesis{Zhai:EECS-2025-9,
    Author= {Zhai, Simon},
    Title= {Towards Vision-LanguageFoundationModels: Limitations,Improvements, and Generalization},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Month= {Mar},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-9.html},
    Number= {UCB/EECS-2025-9},
    Abstract= {This dissertation investigates critical challenges in the development and training of multimodal foundation models, focusing on limitations in current supervised fine-tuning (SFT) approaches and exploring the potential of reinforcement learning (RL) to achieve robust generalization. The work is presented in two main parts:

Part 1: Understanding Limitations of Multimodal Foundation Models under Supervised Fine-Tuning

Despite their impressive capabilities on benchmark tasks, multimodal large language models (MLLMs) often exhibit surprising weaknesses when faced with seemingly simple tasks requiring deeper understanding or adaptation to novel situations. This dissertation first investigates the phenomenon of catastrophic forgetting in MLLMs, where fine-tuning on new tasks can lead to a significant decline in performance on previously learned tasks. We introduce the Evaluating MulTimodality (EMT) framework, a novel evaluation methodology designed to systematically assess this forgetting. Our findings reveal that even MLLMs leveraging powerful pre-trained vision encoders suffer from substantial performance degradation on basic image classification tasks after SFT. Furthermore, we delve into the specific visual shortcomings of MLLMs. We introduce the MultiModal Visual Patterns (MMVP) benchmark, a carefully curated set of visual question-answering tasks designed to probe the visual grounding capabilities of these models. The results demonstrate systematic failures in state-of-the-art MLLMs, highlighting a strong correlation between weaknesses in the underlying visual encoder (CLIP) and overall model performance. These findings suggest that current SFT approaches, while effective for task-specific adaptation, may not be sufficient for imbuing MLLMs with robust visual understanding and the ability to retain previously acquired knowledge.

Part 2: Enhancing Generalization through Reinforcement Learning

Recognizing the limitations of SFT, this dissertation then explores the potential of reinforcement learning (RL) to achieve more robust and generalizable multimodal intelligence. We propose a novel framework for fine-tuning large vision-language models (VLMs) with RL, enabling end-to-end training on tasks that require both visual understanding and language reasoning. A key component of this framework is the incorporation of chain-of-thought (CoT) prompting, which leverages the inherent reasoning capabilities of VLMs to facilitate more efficient exploration and learning. We conduct a comparative analysis of RL and SFT, focusing on generalization to unseen rule variations and novel visual contexts. The results demonstrate that RL fine-tuning consistently leads to superior generalization compared to SFT. Models trained with RL exhibit improved performance on tasks with modified rules, adapt more effectively to variations in visual input, and even show enhanced underlying visual recognition abilities. Furthermore, we investigate the role of inference-time computation, demonstrating that increasing the number of verification iterations during RL training further improves generalization. This highlights that while SFT provides a necessary foundation for instruction following, RL is crucial for achieving robust, adaptable performance in complex, dynamic environments.

In summary, this dissertation provides compelling evidence for the limitations of current SFT-based training of multimodal foundation models and showcases the significant potential of RL to overcome these limitations, paving the way for more generalizable and intelligent AI systems.},
}

EndNote citation:

%0 Thesis
%A Zhai, Simon 
%T Towards Vision-LanguageFoundationModels: Limitations,Improvements, and Generalization
%I EECS Department, University of California, Berkeley
%D 2025
%8 March 6
%@ UCB/EECS-2025-9
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-9.html
%F Zhai:EECS-2025-9