Sheng Shen
EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2024-186
August 19, 2024
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-186.pdf
The rapid advancement of large multimodal models (LMMs) has revolutionized the field of deep learning, enabling sophisticated understanding and generation across various modalities like text, images, and audio. The increasing demand for deploying these powerful models in diverse real-world scenarios, ranging from resource-constrained edge devices to large-scale cloud environments, necessitates research into scaling LMMs both smaller and larger. Small models cater to the need for efficiency, particularly important for deploying multimodal systems on edge devices, while large models emphasize the pursuit of scalability - the capability to harness ever-increasing computational power and data to achieve higher accuracy. In this thesis, we will explore techniques to scale LMMs both up and down, focusing on three key areas: inference efficiency, training scalability, and enhanced multimodality. We begin by investigating methods like quantization and pruning to reduce the computational and memory footprint of LMMs, making them suitable for deployment on resource-constrained devices. Subsequently, we delve into techniques such as Mixture-of-Experts (MoE) and staged scaling to enable the efficient training of LMMs with massive parameter counts, allowing us to leverage the power of large-scale data and computational resources. Adopting a holistic approach, we reevaluate the paradigm of efficient training and inference of LMMs by strategically scaling up for efficient training and then scaling down for optimized inference. Lastly, we examine domain-specific optimizations for LMMs, including improved vision encoders that enable more transferable applications of LMMs, efficient LMM pre-training with knowledge-augmented data, and data-centric LMM alignment with factuality-grounded RLHF.
Advisor: Kurt Keutzer and Trevor Darrell
"; ?>
BibTeX citation:
@phdthesis{Shen:EECS-2024-186, Author = {Shen, Sheng}, Editor = {Darrell, Trevor and Keutzer, Kurt}, Title = {Efficient and Scalable Large Multimodal Models}, School = {EECS Department, University of California, Berkeley}, Year = {2024}, Month = {Aug}, URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-186.html}, Number = {UCB/EECS-2024-186}, Abstract = {The rapid advancement of large multimodal models (LMMs) has revolutionized the field of deep learning, enabling sophisticated understanding and generation across various modalities like text, images, and audio. The increasing demand for deploying these powerful models in diverse real-world scenarios, ranging from resource-constrained edge devices to large-scale cloud environments, necessitates research into scaling LMMs both smaller and larger. Small models cater to the need for efficiency, particularly important for deploying multimodal systems on edge devices, while large models emphasize the pursuit of scalability - the capability to harness ever-increasing computational power and data to achieve higher accuracy. In this thesis, we will explore techniques to scale LMMs both up and down, focusing on three key areas: inference efficiency, training scalability, and enhanced multimodality. We begin by investigating methods like quantization and pruning to reduce the computational and memory footprint of LMMs, making them suitable for deployment on resource-constrained devices. Subsequently, we delve into techniques such as Mixture-of-Experts (MoE) and staged scaling to enable the efficient training of LMMs with massive parameter counts, allowing us to leverage the power of large-scale data and computational resources. Adopting a holistic approach, we reevaluate the paradigm of efficient training and inference of LMMs by strategically scaling up for efficient training and then scaling down for optimized inference. Lastly, we examine domain-specific optimizations for LMMs, including improved vision encoders that enable more transferable applications of LMMs, efficient LMM pre-training with knowledge-augmented data, and data-centric LMM alignment with factuality-grounded RLHF.} }
EndNote citation:
%0 Thesis %A Shen, Sheng %E Darrell, Trevor %E Keutzer, Kurt %T Efficient and Scalable Large Multimodal Models %I EECS Department, University of California, Berkeley %D 2024 %8 August 19 %@ UCB/EECS-2024-186 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-186.html %F Shen:EECS-2024-186