Rising Stars 2020:

Amandalynne Paullada

PhD Candidate

University of Washington

Areas of Interest

  • Artificial Intelligence
  • Natural Language Processing


Dataset (Dis)contents: A survey of data collection and usage practices in machine learning


Datasets have played a foundational role in the advancement of machine learning research. Increasingly large datasets are used as our primary medium for benchmarking and evaluation in the field. Furthermore, the ways in which we collect, construct and share these datasets inform the kinds of problems we pursue and the methods we explore in algorithm development. However, much recent work has revealed the limitations of predominant practices in dataset collection and use. We survey some of the shortcomings of widely used datasets and data practices, spanning from statistical and representational issues embedded in dataset contents to legal and moral issues with dataset collection and distribution, and advocate that a more cautious and thorough understanding of data is necessary to address several of the practical and ethical issues of machine learning applications.


I am a PhD candidate in the Department of Linguistics at the University of Washington, advised by Prof. Fei Xia. My research is also supervised by Prof. Trevor Cohen in the Department of Biomedical Informatics and Medical Education at UW. My work has involved methods for processing large document collections to augment human capabilities toward insights in biomedicine and sociology. In ongoing work, I am investigating participatory methods for natural language data collection. I also hold bachelor's degrees in linguistics and economics from the University of California, Santa Cruz, as well as a master's degree in computational linguistics from Brandeis University.

Personal home page