Re-examining Metrics for Success in Machine Learning, from Fairness and Interpretability to Protein Design
Frances Ding
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2024-156
August 4, 2024
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-156.pdf
Quantitative metrics, along with datasets to assess them with, are key ingredients that have fueled rapid progress in machine learning (ML) in recent years. These metrics, datasets, and benchmarks define priorities and facilitate efficient discovery of model designs that make progress on those priorities. Ideally, metrics track real world goals, such that improvement on them translates to improvement in related, real tasks. Creating metrics that achieve this external validity is an ever-present challenge in ML. Thus, the science of metrics is an iterative one, as identifying and resolving one issue allows other, more subtle ones, to become apparent.
In this thesis, we describe a series of works that highlight limitations in metrics across different subfields of ML and design new metrics to fill these gaps. We first examine representation similarity metrics used in the interpretability subfield to compare neural network representations. We show that current popular metrics often disagree on fundamental observations, making it unclear which one we should believe. We develop practical, statistically grounded tests to evaluate these metrics and find different weaknesses in each. We next examine metrics and benchmarks for fair classification. We highlight idiosyncrasies in the popular UCI Adult dataset that limit its external validity, and we contribute a suite of new datasets derived from US Census surveys that extend the existing data ecosystem for research on fair machine learning. Finally, we examine the subfield of protein modeling with ML. We develop metrics to quantify a novel type of bias present in popular protein language models -- bias towards sequences from certain evolutionary taxa. We additionally introduce a method to mitigate this bias. Across these works in diverse subfields, we demonstrate the challenges and opportunities present in developing metrics that advance technical capabilities in alignment with real world needs.
Advisors: Moritz Hardt and Jacob Steinhardt
BibTeX citation:
@phdthesis{Ding:EECS-2024-156, Author= {Ding, Frances}, Title= {Re-examining Metrics for Success in Machine Learning, from Fairness and Interpretability to Protein Design}, School= {EECS Department, University of California, Berkeley}, Year= {2024}, Month= {Aug}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-156.html}, Number= {UCB/EECS-2024-156}, Abstract= {Quantitative metrics, along with datasets to assess them with, are key ingredients that have fueled rapid progress in machine learning (ML) in recent years. These metrics, datasets, and benchmarks define priorities and facilitate efficient discovery of model designs that make progress on those priorities. Ideally, metrics track real world goals, such that improvement on them translates to improvement in related, real tasks. Creating metrics that achieve this external validity is an ever-present challenge in ML. Thus, the science of metrics is an iterative one, as identifying and resolving one issue allows other, more subtle ones, to become apparent. In this thesis, we describe a series of works that highlight limitations in metrics across different subfields of ML and design new metrics to fill these gaps. We first examine representation similarity metrics used in the interpretability subfield to compare neural network representations. We show that current popular metrics often disagree on fundamental observations, making it unclear which one we should believe. We develop practical, statistically grounded tests to evaluate these metrics and find different weaknesses in each. We next examine metrics and benchmarks for fair classification. We highlight idiosyncrasies in the popular UCI Adult dataset that limit its external validity, and we contribute a suite of new datasets derived from US Census surveys that extend the existing data ecosystem for research on fair machine learning. Finally, we examine the subfield of protein modeling with ML. We develop metrics to quantify a novel type of bias present in popular protein language models -- bias towards sequences from certain evolutionary taxa. We additionally introduce a method to mitigate this bias. Across these works in diverse subfields, we demonstrate the challenges and opportunities present in developing metrics that advance technical capabilities in alignment with real world needs.}, }
EndNote citation:
%0 Thesis %A Ding, Frances %T Re-examining Metrics for Success in Machine Learning, from Fairness and Interpretability to Protein Design %I EECS Department, University of California, Berkeley %D 2024 %8 August 4 %@ UCB/EECS-2024-156 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-156.html %F Ding:EECS-2024-156