Rising Stars 2020:

Huda Khayrallah

PhD Candidate

Johns Hopkins University


Areas of Interest

  • Artificial Intelligence

Poster

Simulated Multiple Reference Training: Leveraging Paraphrases for Machine Translation

Abstract

Machine translation (MT) automatically translates text from one language to another and has the potential to reduce language barriers, by improving communication and information access. However, for this to become a reality MT must be effective for all languages and styles of text. Like most deep learning algorithms, neural machine translation (NMT) is sensitive to the quantity and quality of the training data. NMT training data typically comes in the form of parallel text---sentences translated between the two languages of interest. Limited quantities of parallel text are available for most language pairs, leading to a low-resource problem. Even when training data is available in the desired language pair, it is frequently formal speech or news---leading to a domain shift when models are used to translate a different type of data, such as social media or medical text. NMT currently performs poorly under domain shift and in and low-resource settings; my work aims to overcome these limitations. I will discuss in detail one line of work for low-resource settings.

Many valid translations exist for a given sentence, yet deep learning models for machine translation (MT) are trained with a single reference translation, exacerbating data sparsity in low-resource settings. We introduce Simulated Multiple Reference Training (SMRT), a novel training method that approximates the full space of possible translations by sampling a paraphrase of the reference sentence from a pre-trained paraphraser and training the MT model to predict the paraphraser's distribution over possible words. A high quality paraphraser---which takes as input an sentence and outputs a paraphrase in the same sentence---can be trained as long as the target language we would like to translate to is sufficiently high resource. We demonstrate this data-augmentation method is effective for low-resource machine translation, and we also apply this method to training chatbots, where we find it produces better, more diverse responses than standard single-reference training.

Bio

Huda Khayrallah is a PhD candidate in Computer Science at The Johns Hopkins University where she is advised by Philipp Koehn. She is part of the Center for Language and Speech Processing (CLSP) and the machine translation group. She works on applied Machine Learning (ML) for Natural Language Processing (NLP), primarily machine translation. Her work focuses on overcoming deep learning’s sensitivity to the quantity and quality of the training data, including low resource and domain adaptation settings. In Summer 2019, she was a research intern at Lilt, working on translator-in-the-loop machine translation. She holds an M.S.E. in Computer Science from Johns Hopkins (2017), and a B.A. in Computer Science from UC Berkeley (2015).

Personal home page