Building Language Technologies for Low-Resourced Languages
Hellina Hailu Nigatu
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2025-218
December 19, 2025
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-218.pdf
The majority of the world’s languages are categorized as understudied and underserved in Natural Language Processing research. Collectively, these languages are termed “low-resource languages.” As the languages themselves are diverse, so are the cultural, political, and social contexts in which they are spoken. Most modern language technologies—e.g., machine translation, speech recognition—either exclude low-resourced languages entirely or perform poorly with low-resourced language data.
In this thesis, I argue that language technologies are parts of complex socio-technical systems, not isolated tools. I lay out my thesis in two parts: Part I presents work that evaluates existing language technologies in the social contexts in which they are deployed. Findings from Part I illustrate how the diverse social contexts in which languages are spoken affect how well our tools perform. Part II presents work that integrates the social contexts of low-resourced languages in how we design language technologies. The findings from Part II demonstrate that by designing language technologies as parts of a larger socio-technical system, we can 1) improve performance, 2) map novel design spaces, and 3) build tools that match users’ existing practices instead of requiring them to adapt to new forms of interaction. Based on the findings and insights from both Part I and Part II, I present design recommendations for NLP and HCI researchers interested in building language technologies for low-resourced languages.
Advisors: John F. Canny and Sarah Chasins
BibTeX citation:
@phdthesis{Nigatu:EECS-2025-218,
Author= {Nigatu, Hellina Hailu},
Title= {Building Language Technologies for Low-Resourced Languages},
School= {EECS Department, University of California, Berkeley},
Year= {2025},
Month= {Dec},
Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-218.html},
Number= {UCB/EECS-2025-218},
Abstract= {The majority of the world’s languages are categorized as understudied and underserved in Natural Language Processing research. Collectively, these languages are termed “low-resource languages.” As the languages themselves are diverse, so are the cultural, political, and social contexts in which they are spoken. Most modern language technologies—e.g., machine translation, speech recognition—either exclude low-resourced languages entirely or perform poorly with low-resourced language data.
In this thesis, I argue that language technologies are parts of complex socio-technical systems, not isolated tools. I lay out my thesis in two parts: Part I presents work that evaluates existing language technologies in the social contexts in which they are deployed. Findings from Part I illustrate how the diverse social contexts in which languages are spoken affect how well our tools perform. Part II presents work that integrates the social contexts of low-resourced languages in how we design language technologies. The findings from Part II demonstrate that by designing language technologies as parts of a larger socio-technical
system, we can 1) improve performance, 2) map novel design spaces, and 3) build tools that match users’ existing practices instead of requiring them to adapt to new forms of interaction. Based on the findings and insights from both Part I and Part II, I present design recommendations for NLP and HCI researchers interested in building language technologies for low-resourced languages.},
}
EndNote citation:
%0 Thesis %A Nigatu, Hellina Hailu %T Building Language Technologies for Low-Resourced Languages %I EECS Department, University of California, Berkeley %D 2025 %8 December 19 %@ UCB/EECS-2025-218 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-218.html %F Nigatu:EECS-2025-218