Efficient and Scalable Large Multimodal Models

Sheng Shen

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-186

August 19, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-186.pdf

The rapid advancement of large multimodal models (LMMs) has revolutionized the field of deep learning, enabling sophisticated understanding and generation across various modalities like text, images, and audio. The increasing demand for deploying these powerful models in diverse real-world scenarios, ranging from resource-constrained edge devices to large-scale cloud environments, necessitates research into scaling LMMs both smaller and larger. Small models cater to the need for efficiency, particularly important for deploying multimodal systems on edge devices, while large models emphasize the pursuit of scalability - the capability to harness ever-increasing computational power and data to achieve higher accuracy. In this thesis, we will explore techniques to scale LMMs both up and down, focusing on three key areas: inference efficiency, training scalability, and enhanced multimodality. We begin by investigating methods like quantization and pruning to reduce the computational and memory footprint of LMMs, making them suitable for deployment on resource-constrained devices. Subsequently, we delve into techniques such as Mixture-of-Experts (MoE) and staged scaling to enable the efficient training of LMMs with massive parameter counts, allowing us to leverage the power of large-scale data and computational resources. Adopting a holistic approach, we reevaluate the paradigm of efficient training and inference of LMMs by strategically scaling up for efficient training and then scaling down for optimized inference. Lastly, we examine domain-specific optimizations for LMMs, including improved vision encoders that enable more transferable applications of LMMs, efficient LMM pre-training with knowledge-augmented data, and data-centric LMM alignment with factuality-grounded RLHF.

Advisors: Kurt Keutzer and Trevor Darrell

BibTeX citation:

@phdthesis{Shen:EECS-2024-186,
    Author= {Shen, Sheng},
    Editor= {Darrell, Trevor and Keutzer, Kurt},
    Title= {Efficient and Scalable Large Multimodal Models},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {Aug},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-186.html},
    Number= {UCB/EECS-2024-186},
    Abstract= {The rapid advancement of large multimodal models (LMMs) has revolutionized the field of
deep learning, enabling sophisticated understanding and generation across various modalities
like text, images, and audio. The increasing demand for deploying these powerful models in
diverse real-world scenarios, ranging from resource-constrained edge devices to large-scale
cloud environments, necessitates research into scaling LMMs both smaller and larger.
Small models cater to the need for efficiency, particularly important for deploying multimodal
systems on edge devices, while large models emphasize the pursuit of scalability - the capability
to harness ever-increasing computational power and data to achieve higher accuracy.
In this thesis, we will explore techniques to scale LMMs both up and down, focusing on three
key areas: inference efficiency, training scalability, and enhanced multimodality. We begin by
investigating methods like quantization and pruning to reduce the computational and memory
footprint of LMMs, making them suitable for deployment on resource-constrained devices.
Subsequently, we delve into techniques such as Mixture-of-Experts (MoE) and staged scaling
to enable the efficient training of LMMs with massive parameter counts, allowing us to leverage
the power of large-scale data and computational resources. Adopting a holistic approach, we
reevaluate the paradigm of efficient training and inference of LMMs by strategically scaling
up for efficient training and then scaling down for optimized inference. Lastly, we examine
domain-specific optimizations for LMMs, including improved vision encoders that enable more
transferable applications of LMMs, efficient LMM pre-training with knowledge-augmented
data, and data-centric LMM alignment with factuality-grounded RLHF.},
}

EndNote citation:

%0 Thesis
%A Shen, Sheng 
%E Darrell, Trevor 
%E Keutzer, Kurt 
%T Efficient and Scalable Large Multimodal Models
%I EECS Department, University of California, Berkeley
%D 2024
%8 August 19
%@ UCB/EECS-2024-186
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-186.html
%F Shen:EECS-2024-186