Counting Counts: Overcoming Counting Challenges in Image Generation using Reinforcement Learning

Shaan Gill

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2024-19
April 24, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-19.pdf

Diffusion models have quickly become state-of-the-art for high-resolution image synthesis. With an iterative forward and subsequent reverse diffusion process, diffusion models serve as a flexible tool to sequentially generate outputs on downstream objectives. Past works such as denoising diffusion policy optimization (DDPO) by Black et al. have employed deep Reinforcement Learning (RL) techniques to directly fine-tune diffusion models for downstream objectives. DDPO achieves success in optimizing text-to-image diffusion models for various objectives but presents challenges in ensuring reliable semantic alignment for specific subsets of tasks. Particularly, text-to-image Stable Diffusion models prompted with text prompts of the form “N objects” fail to produce images consistent with the expected count, even after DDPO prompt alignment fine-tuning. We term this as the N-objects counting problem, characterized by the mismatch of the expected count and the count of generated objects in the image. This research serves as progress towards solving the full class of N-objects problems by focusing on a subset of the format “N [color] balls on white background”. We advanced toward solving this problem by implementing several vision-informed reward functions and training models using curriculum learning techniques. Our results demonstrated that fine-tuning Stable Diffusion models under our proposed reward functions improved fidelity to the counts. Our fine-tuning resolved many empirically observed issues with the baseline model, including the wide distribution of object counts among samples prompting for N less than or equal to 5. After fine-tuning, sampled images yielded a count distribution that was tightly clustered around a normal distribution, with the mean closely aligned with the anticipated count. Our curriculum learning approach improved these results with a marked difference for the complex 7-balls case. Furthermore, our solution generalized to N-objects problems beyond the subset of counting tasks we focused on. Our results convey promise for this technique as a solution to the overarching N-objects problem and for prompt-image alignment for text-to-image diffusion models.

Advisor: Trevor Darrell


BibTeX citation:

@mastersthesis{Gill:EECS-2024-19,
    Author = {Gill, Shaan},
    Title = {Counting Counts: Overcoming Counting Challenges in Image Generation using Reinforcement Learning},
    School = {EECS Department, University of California, Berkeley},
    Year = {2024},
    Month = {Apr},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-19.html},
    Number = {UCB/EECS-2024-19},
    Abstract = {Diffusion models have quickly become state-of-the-art for high-resolution image synthesis. With an iterative forward and subsequent reverse diffusion process, diffusion models serve as a flexible tool to sequentially generate outputs on downstream objectives. Past works such as denoising diffusion policy optimization (DDPO) by Black et al. have employed deep Reinforcement Learning (RL) techniques to directly fine-tune diffusion models for downstream objectives. DDPO achieves success in optimizing text-to-image diffusion models for various objectives but presents challenges in ensuring reliable semantic alignment for specific subsets of tasks. Particularly, text-to-image Stable Diffusion models prompted with text prompts of the form “N objects” fail to produce images consistent with the expected count, even after DDPO prompt alignment fine-tuning. We term this as the N-objects counting problem, characterized by the mismatch of the expected count and the count of generated objects in the image. This research serves as progress towards solving the full class of N-objects problems by focusing on a subset of the format “N [color] balls on white background”. We advanced toward solving this problem by implementing several vision-informed reward functions and training models using curriculum learning techniques. Our results demonstrated that fine-tuning Stable Diffusion models under our proposed reward functions improved fidelity to the counts. Our fine-tuning resolved many empirically observed issues with the baseline model, including the wide distribution of object counts among samples prompting for N less than or equal to 5. After fine-tuning, sampled images yielded a count distribution that was tightly clustered around a normal distribution, with the mean closely aligned with the anticipated count. Our curriculum learning approach improved these results with a marked difference for the complex 7-balls case. Furthermore, our solution generalized to N-objects problems beyond the subset of counting tasks we focused on. Our results convey promise for this technique as a solution to the overarching N-objects problem and for prompt-image alignment for text-to-image diffusion models.}
}

EndNote citation:

%0 Thesis
%A Gill, Shaan
%T Counting Counts: Overcoming Counting Challenges in Image Generation using Reinforcement Learning
%I EECS Department, University of California, Berkeley
%D 2024
%8 April 24
%@ UCB/EECS-2024-19
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-19.html
%F Gill:EECS-2024-19