Jerome Quenum

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2025-153

August 12, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-153.pdf

Active perception driven by representation learning lies at the intersection of computer vision and robotics, powered by advanced Artificial Intelligence (AI) algorithms. Its goal is to develop systems that actively gather information from their environment by learning useful representations to enhance both perception and decision-making capabilities. This dissertation explores various facets of this topic, focusing on how representation-learning techniques can significantly improve performance and utility in such systems.

We begin by examining the role of segmentation precision in AI-driven systems using Ultra-High-Resolution (UHR) images in the context of battery design. A novel transformer-based network, TransforCNN, is introduced to segment dendrites in Lithium Metal Battery (LMB) 3D X-ray computed tomography (XCT) volumes. This approach improves segmentation accuracy of critical structures, which is essential for designing more efficient and reliable batteries. The precision gained through our proposed model not only enhances visual understanding of battery components but also supports more effective design strategies.

Building on this, we address the challenge of processing UHR images for applications requiring both speed and accuracy. In barcode detection, we propose a new pipeline integrating a modified Region Proposal Network (RPN) with a segmentation network named Y-Net. This demonstrates how representation learning can reduce latency while maintaining high accuracy, showcasing AI’s ability to handle complex visual tasks in real time.

We then examine scalability in using representation learning for active perception in robotics, where speed, accuracy, adaptability, and robustness are key. Using the Open X Embodiment (OXE) dataset, we explore cross-robot self-supervised sensorimotor pre-training through RPTx, a multimodal model that uses neither language instructions nor autoregressive methods. This approach provides insights into robustness and adaptability challenges. We conduct reliability assessments, examine negative transfer cases, and outline future directions. Additionally, we find that multimodal models using large language models for instruction tuning and autoregressive techniques such as LLARVA, a vision-action instruction tuning paradigm, yield more stable and promising results. These findings show how effective representation learning enhances robotic systems' ability to interpret and respond to complex environmental cues.

Finally, we explore the challenge of extending segmentation beyond fixed object categories to support reasoning over visual scenes. Current models perform well on clearly defined objects but struggle with complex queries referencing multiple or implicit objects. Recent work in reasoning segmentation, which generates masks from natural language input, shows that vision-language models (VLMs) can help. However, our experiments reveal that existing models perform poorly on remote-sensing images, which are dense and varied. To address this, we introduce LISAt, a VLM designed to describe remote-sensing images, answer questions, and segment objects from user queries. LISAt is trained on GRES, a dataset with 27,615 annotations across 9,205 images, and PreGRES, a multimodal dataset with over one million QA pairs. LISAt outperforms geospatial models like RS-GPT4V by 10.04% in BLEU-4 for image description and improves on open-domain models in reasoning segmentation by 143.36% in generalized Intersection over Union (gIoU). These results demonstrate how strong multimodal representations improve scene understanding, particularly in the geospatial domain.

Advisors: Jitendra Malik and Trevor Darrell


BibTeX citation:

@phdthesis{Quenum:EECS-2025-153,
    Author= {Quenum, Jerome},
    Title= {Representation Learning for Active Perception},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Month= {Aug},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-153.html},
    Number= {UCB/EECS-2025-153},
    Abstract= {Active perception driven by representation learning lies at the intersection of computer vision and robotics, powered by advanced Artificial Intelligence (AI) algorithms. Its goal is to develop systems that actively gather information from their environment by learning useful representations to enhance both perception and decision-making capabilities. This dissertation explores various facets of this topic, focusing on how representation-learning techniques can significantly improve performance and utility in such systems.

We begin by examining the role of segmentation precision in AI-driven systems using Ultra-High-Resolution (UHR) images in the context of battery design. A novel transformer-based network, TransforCNN, is introduced to segment dendrites in Lithium Metal Battery (LMB) 3D X-ray computed tomography (XCT) volumes. This approach improves segmentation accuracy of critical structures, which is essential for designing more efficient and reliable batteries. The precision gained through our proposed model not only enhances visual understanding of battery components but also supports more effective design strategies.

Building on this, we address the challenge of processing UHR images for applications requiring both speed and accuracy. In barcode detection, we propose a new pipeline integrating a modified Region Proposal Network (RPN) with a segmentation network named Y-Net. This demonstrates how representation learning can reduce latency while maintaining high accuracy, showcasing AI’s ability to handle complex visual tasks in real time.

We then examine scalability in using representation learning for active perception in robotics, where speed, accuracy, adaptability, and robustness are key. Using the Open X Embodiment (OXE) dataset, we explore cross-robot self-supervised sensorimotor pre-training through RPTx, a multimodal model that uses neither language instructions nor autoregressive methods. This approach provides insights into robustness and adaptability challenges. We conduct reliability assessments, examine negative transfer cases, and outline future directions. Additionally, we find that multimodal models using large language models for instruction tuning and autoregressive techniques such as LLARVA, a vision-action instruction tuning paradigm, yield more stable and promising results. These findings show how effective representation learning enhances robotic systems' ability to interpret and respond to complex environmental cues.

Finally, we explore the challenge of extending segmentation beyond fixed object categories to support reasoning over visual scenes. Current models perform well on clearly defined objects but struggle with complex queries referencing multiple or implicit objects. Recent work in reasoning segmentation, which generates masks from natural language input, shows that vision-language models (VLMs) can help. However, our experiments reveal that existing models perform poorly on remote-sensing images, which are dense and varied. To address this, we introduce LISAt, a VLM designed to describe remote-sensing images, answer questions, and segment objects from user queries. LISAt is trained on GRES, a dataset with 27,615 annotations across 9,205 images, and PreGRES, a multimodal dataset with over one million QA pairs. LISAt outperforms geospatial models like RS-GPT4V by 10.04% in BLEU-4 for image description and improves on open-domain models in reasoning segmentation by 143.36% in generalized Intersection over Union (gIoU). These results demonstrate how strong multimodal representations improve scene understanding, particularly in the geospatial domain.},
}

EndNote citation:

%0 Thesis
%A Quenum, Jerome 
%T Representation Learning for Active Perception
%I EECS Department, University of California, Berkeley
%D 2025
%8 August 12
%@ UCB/EECS-2025-153
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-153.html
%F Quenum:EECS-2025-153