Vedaad Shakib and Spencer Whitehead and Suzanne Petryk

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2022-137

May 18, 2022

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-137.pdf

Machine learning has advanced dramatically, narrowing the accuracy gap to humans in multimodal tasks like visual question answering (VQA). However, while humans can say “I don’t know” when they are uncertain (i.e., abstain from answering a question), such ability has been largely neglected in multimodal research, despite the importance of this problem to the usage of VQA in real settings. In this work, we promote a problem formulation for reliable VQA, where we prefer abstention over providing an incorrect answer. We first enable abstention capabilities for several VQA models, and analyze both their coverage, the portion of questions answered, and risk, the error on that portion. For that we explore several abstention approaches. We find that although the best performing models achieve over 71% accuracy on the VQA v2 dataset, introducing the option to abstain by directly using a model’s softmax scores limits them to answering less than 8% of the questions to achieve a low risk of error (i.e., 1%). This motivates us to utilize a multimodal selection function to directly estimate the correctness of the predicted answers, which we show can triple the coverage from, for example, 5.0% to 16.7% at 1% risk. We also explore probabilistic calibration of VQA models, which improves coverage but less so than a multimodel selection function. While it is important to analyze both coverage and risk, these metrics have a trade-off which makes comparing VQA models challenging. To address this, we also propose an Effective Reliability metric for VQA that places a larger cost on incorrect answers compared to abstentions. This new problem formulation, metric, and analysis for VQA provide the groundwork for building effective and reliable VQA models that have the self-awareness to abstain if and only if they don’t know the answer.

Advisors: Joseph Gonzalez


BibTeX citation:

@mastersthesis{Shakib:EECS-2022-137,
    Author= {Shakib, Vedaad and Whitehead, Spencer and Petryk, Suzanne},
    Editor= {Gonzalez, Joseph and Darrell, Trevor and Rohrbach, Anna and Rohrbach, Marcus},
    Title= {Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly},
    School= {EECS Department, University of California, Berkeley},
    Year= {2022},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-137.html},
    Number= {UCB/EECS-2022-137},
    Abstract= {Machine learning has advanced dramatically, narrowing the accuracy gap to humans in multimodal tasks like visual question answering (VQA). However, while humans can say “I don’t know” when they are uncertain (i.e., abstain from answering a question), such ability has been largely neglected in multimodal research, despite the importance of this problem to the usage of VQA in real settings. In this work, we promote a problem formulation for reliable VQA, where we prefer abstention over providing an incorrect answer. We first enable abstention capabilities for several VQA models, and analyze both their coverage, the portion of questions answered, and risk, the error on that portion. For that we explore several abstention approaches. We find that although the best performing models achieve over 71% accuracy on the VQA v2 dataset, introducing the option to abstain by directly using a model’s softmax scores limits them to answering less than 8% of the questions to achieve a low risk of error (i.e., 1%). This motivates us to utilize a multimodal selection function to directly estimate the correctness of the predicted answers, which we show can triple the coverage from, for example, 5.0% to 16.7% at 1% risk. We also explore probabilistic calibration of VQA models, which improves coverage but less so than a multimodel selection function. While it is important to analyze both coverage and risk, these metrics have a trade-off which makes comparing VQA models challenging. To address this, we also propose an Effective Reliability metric for VQA that places a larger cost on incorrect answers compared to abstentions. This new problem formulation, metric, and analysis for VQA provide the groundwork for building effective and reliable VQA models that have the self-awareness to abstain if and only if they don’t know the answer.},
}

EndNote citation:

%0 Thesis
%A Shakib, Vedaad 
%A Whitehead, Spencer 
%A Petryk, Suzanne 
%E Gonzalez, Joseph 
%E Darrell, Trevor 
%E Rohrbach, Anna 
%E Rohrbach, Marcus 
%T Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly
%I EECS Department, University of California, Berkeley
%D 2022
%8 May 18
%@ UCB/EECS-2022-137
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-137.html
%F Shakib:EECS-2022-137