Multimodal Long-Term Video Understanding

Medhini Gulganjalli Narasimhan

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2023-206

August 10, 2023

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-206.pdf

The internet hosts an immense reservoir of videos, witnessing a constant influx of thousands of uploads to platforms like YouTube every second. These videos represent a valuable repository of multimodal information, providing an invaluable resource for understanding audio-visual-text relationships. Moreover, understanding the content in long videos (think 2 hours), is an open problem. This thesis investigates the intricate interplay between diverse modalities—audio, visual, and textual—in videos and harnesses their potential for comprehending semantic nuances within long videos. My research explores diverse strategies for combining information from these modalities, leading to significant advancements in video summarization and instructional video analysis.

The first part introduces an approach to synthesizing long video textures from short clips by rearranging segments coherently, while also considering audio conditioning. The second part discusses a novel technique for generating concise visual summaries of lengthy videos guided by natural language cues. Additionally, we focus specifically on summarizing instructional videos, capitalizing on audio-visual alignments and task structures to produce informative summaries. To further enrich the comprehension of instructional videos, the thesis introduces a cutting-edge approach that facilitates the learning and verification of procedural steps within instructional content, empowering the model to grasp long and complex video sequences and ensure procedural accuracy. Lastly, the potential of large language models is explored for answering questions about images through code generation. Through comprehensive experiments, the research demonstrates the efficacy of the proposed methodologies, envisioning promising future prospects in the field of semantics in long videos by integrating audio, visual, and textual relationships.

Advisors: Trevor Darrell

BibTeX citation:

@phdthesis{Gulganjalli Narasimhan:EECS-2023-206,
    Author= {Gulganjalli Narasimhan, Medhini},
    Title= {Multimodal Long-Term Video Understanding},
    School= {EECS Department, University of California, Berkeley},
    Year= {2023},
    Month= {Aug},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-206.html},
    Number= {UCB/EECS-2023-206},
    Abstract= {The internet hosts an immense reservoir of videos, witnessing a constant influx of thousands of uploads to platforms like YouTube every second. These videos represent a valuable repository of multimodal information, providing an invaluable resource for understanding audio-visual-text relationships. Moreover, understanding the content in long videos (think 2 hours), is an open problem. This thesis investigates the intricate interplay between diverse modalities—audio, visual, and textual—in videos and harnesses their potential for comprehending semantic nuances within long videos. My research explores diverse strategies for combining information from these modalities, leading to significant advancements in video summarization and instructional video analysis. 

The first part introduces an approach to synthesizing long video textures from short clips by rearranging segments coherently, while also considering audio conditioning. The second part discusses a novel technique for generating concise visual summaries of lengthy videos guided by natural language cues. Additionally, we focus specifically on summarizing instructional videos, capitalizing on audio-visual alignments and task structures to produce informative summaries. To further enrich the comprehension of instructional videos, the thesis introduces a cutting-edge approach that facilitates the learning and verification of procedural steps within instructional content, empowering the model to grasp long and complex video sequences and ensure procedural accuracy. Lastly, the potential of large language models is explored for answering questions about images through code generation. Through comprehensive experiments, the research demonstrates the efficacy of the proposed methodologies, envisioning promising future prospects in the field of semantics in long videos by integrating audio, visual, and textual relationships.},
}

EndNote citation:

%0 Thesis
%A Gulganjalli Narasimhan, Medhini 
%T Multimodal Long-Term Video Understanding
%I EECS Department, University of California, Berkeley
%D 2023
%8 August 10
%@ UCB/EECS-2023-206
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-206.html
%F Gulganjalli Narasimhan:EECS-2023-206