Improving Output in Generative Models

Tarun Amarnath

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-171

August 9, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-171.pdf

Generative models have proven widely successful in creating language and image responses to prompts. These techniques have been extended to video creation, robotics, and personal assistants, among other use cases. However, as their pervasiveness grows, a major issue arises: they have no guarantees of correctness. The networks themselves are encapsulated into black boxes, making them difficult to interpret. This work instead aims to improve their output in the pre- and post-processing stages, increasing overall accuracy. The problem is approached through 3 projects: Drone Diffuser (a diffusion-based path planner for drones), RAG Alignment (checking for factual generation of text data in Retrieval-Augmented Generation systems), and IMAGINE (improving output of RAG-based image retrieval).

Drone Diffuser Diffusion models have been successful in image and video generation, leveraging massive amounts of data to achieve remarkable results. Recently, these models have been adapted for the robotics domain, demonstrating advantages such as better performance in long-horizon contexts and more stable training processes. This research extends the application of diffusion models to aerial vehicles. The task is to generate high-level path plans to a goal position denoted by a gate, akin to racing scenarios, in a receding-horizon manner. Two policies are trained: the first utilizes state information as conditioning to characterize the goal, while the second directly uses FPV images from a drone. These policies mimic a privileged expert that employs RRT* to generate near-optimal paths. The reverse diffusion process has no guarantees in its output. As a result, both the training data given for imitation and the output from the policy are fit to a polynomial trajectory using minimum snap optimization to ensure dynamic feasibility for quadrotors. The state-based policy performs exceptionally well, achieving a 100\% accuracy on every plan in the testing set, whereas the image-based policy requires further refinement. Future work can focus on translating these findings to real-world systems.

RAG Alignment In order to better align Retrieval-Augmented Generation (RAG) systems with intended behaviors and factual consistency, we propose the notion of Fact-Bearing Terms (FBTs) as the terms in a sentence upon which its factuality rests, such as proper nouns or direct objects. Applying this notion of FBTs, we demonstrate significant performance improvements over both encoding- and part-of-speech- based approaches to text retrieval in RAG systems using sources from a custom dataset created for this task outside of the training distribution of most large-scale models. In addition, we use this metric to demonstrate a visible boost in performance on HyDE (hypothetical document embeddings)-based retrieval after fine-tuning the HyDE model. We show several practical applications of Fact-Bearing Terms, such as warning users of higher risks of hallucination or citing sources.

IMAGINE Within the language domain, HyDE and other techniques prompting the model to retrieve with a guess or estimate of a desired output demonstrate incredible improvements over standard RAG. Selecting the most relevant images to an abstract user input can be a far more challenging task because there can be a large semantic gap between the user's query and desired output: image similarity may not strongly correlate to semantics for many user contexts. In order to better align image RAG systems with intended behaviors, we propose the IMAGINE system, which creates hypothetical outputs and performs retrieval based on those responses. Depending on the user context and resources available, we present IMAGINE-T (text-based) and IMAGINE-I (diffusion-based) multimodal retrieval mechanisms. We present the CameraRollQA benchmark based on real-world images to evaluate both IMAGINE solutions and demonstrate improved performance over CLIP RAG baselines.

Advisors: S. Shankar Sastry

BibTeX citation:

@mastersthesis{Amarnath:EECS-2024-171,
Author= {Amarnath, Tarun},
Editor= {Sastry, S. Shankar},
Title= {Improving Output in Generative Models},
School= {EECS Department, University of California, Berkeley},
Year= {2024},
Month= {Aug},
Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-171.html},
Number= {UCB/EECS-2024-171},
Abstract= {Generative models have proven widely successful in creating language and image responses to prompts. These techniques have been extended to video creation, robotics, and personal assistants, among other use cases. However, as their pervasiveness grows, a major issue arises: they have no guarantees of correctness. The networks themselves are encapsulated into black boxes, making them difficult to interpret. This work instead aims to improve their output in the pre- and post-processing stages, increasing overall accuracy. The problem is approached through 3 projects: Drone Diffuser (a diffusion-based path planner for drones), RAG Alignment (checking for factual generation of text data in Retrieval-Augmented Generation systems), and IMAGINE (improving output of RAG-based image retrieval).

Drone Diffuser Diffusion models have been successful in image and video generation, leveraging massive amounts of data to achieve remarkable results. Recently, these models have been adapted for the robotics domain, demonstrating advantages such as better performance in long-horizon contexts and more stable training processes. This research extends the application of diffusion models to aerial vehicles. The task is to generate high-level path plans to a goal position denoted by a gate, akin to racing scenarios, in a receding-horizon manner. Two policies are trained: the first utilizes state information as conditioning to characterize the goal, while the second directly uses FPV images from a drone. These policies mimic a privileged expert that employs RRT* to generate near-optimal paths. The reverse diffusion process has no guarantees in its output. As a result, both the training data given for imitation and the output from the policy are fit to a polynomial trajectory using minimum snap optimization to ensure dynamic feasibility for quadrotors. The state-based policy performs exceptionally well, achieving a 100\% accuracy on every plan in the testing set, whereas the image-based policy requires further refinement. Future work can focus on translating these findings to real-world systems.

RAG Alignment In order to better align Retrieval-Augmented Generation (RAG) systems with intended behaviors and factual consistency, we propose the notion of Fact-Bearing Terms (FBTs) as the terms in a sentence upon which its factuality rests, such as proper nouns or direct objects. Applying this notion of FBTs, we demonstrate significant performance improvements over both encoding- and part-of-speech- based approaches to text retrieval in RAG systems using sources from a custom dataset created for this task outside of the training distribution of most large-scale models. In addition, we use this metric to demonstrate a visible boost in performance on HyDE (hypothetical document embeddings)-based retrieval after fine-tuning the HyDE model. We show several practical applications of Fact-Bearing Terms, such as warning users of higher risks of hallucination or citing sources.

IMAGINE Within the language domain, HyDE and other techniques prompting the model to retrieve with a guess or estimate of a desired output demonstrate incredible improvements over standard RAG. Selecting the most relevant images to an abstract user input can be a far more challenging task because there can be a large semantic gap between the user's query and desired output: image similarity may not strongly correlate to semantics for many user contexts. In order to better align image RAG systems with intended behaviors, we propose the IMAGINE system, which creates hypothetical outputs and performs retrieval based on those responses. Depending on the user context and resources available, we present IMAGINE-T (text-based) and IMAGINE-I (diffusion-based) multimodal retrieval mechanisms. We present the CameraRollQA benchmark based on real-world images to evaluate both IMAGINE solutions and demonstrate improved performance over CLIP RAG baselines.},
}

EndNote citation:

%0 Thesis
%A Amarnath, Tarun 
%E Sastry, S. Shankar 
%T Improving Output in Generative Models
%I EECS Department, University of California, Berkeley
%D 2024
%8 August 9
%@ UCB/EECS-2024-171
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-171.html
%F Amarnath:EECS-2024-171