[an error occurred while processing this directive] Amandalynne Paullada [an error occurred while processing this directive]

[an error occurred while processing this directive] [an error occurred while processing this directive]

[an error occurred while processing this directive] [an error occurred while processing this directive] PhD Candidate [an error occurred while processing this directive] University of Washington [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive]

Artificial Intelligence

Natural Language Processing

[an error occurred while processing this directive] Dataset (Dis)contents: A survey of data collection and usage practices in machine learning [an error occurred while processing this directive] Datasets have played a foundational role in the advancement of machine learning research. Increasingly large datasets are used as our primary medium for benchmarking and evaluation in the field. Furthermore, the ways in which we collect, construct and share these datasets inform the kinds of problems we pursue and the methods we explore in algorithm development. However, much recent work has revealed the limitations of predominant practices in dataset collection and use. We survey some of the shortcomings of widely used datasets and data practices, spanning from statistical and representational issues embedded in dataset contents to legal and moral issues with dataset collection and distribution, and advocate that a more cautious and thorough understanding of data is necessary to address several of the practical and ethical issues of machine learning applications. [an error occurred while processing this directive] I am a PhD candidate in the Department of Linguistics at the University of Washington, advised by Prof. Fei Xia. My research is also supervised by Prof. Trevor Cohen in the Department of Biomedical Informatics and Medical Education at UW. My work has involved methods for processing large document collections to augment human capabilities toward insights in biomedicine and sociology. In ongoing work, I am investigating participatory methods for natural language data collection. I also hold bachelor's degrees in linguistics and economics from the University of California, Santa Cruz, as well as a master's degree in computational linguistics from Brandeis University. [an error occurred while processing this directive] Personal home page [an error occurred while processing this directive] [an error occurred while processing this directive]