John Sturdy DeNero

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2010-161

December 16, 2010

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-161.pdf

The goal of a machine translation (MT) system is to automatically translate a document written in some human input language (e.g., Mandarin Chinese) into an equivalent document written in an output language (e.g., English). This task---so simple in its specification, and yet so rich in its complexities---has challenged computer science researchers for 60 years. While MT systems are in wide use today, the problem of producing human-quality translations remains unsolved.

Statistical approaches have substantially improved the quality of MT systems by effectively exploiting parallel corpora: large collections of documents that have been translated by people, and therefore naturally occur in both the input and output languages. Broadly characterized, statistical MT systems translate an input document by matching fragments of its contents to examples in a parallel corpus, and then stitching together the translations of those fragments into a coherent document in an output language.

The central challenge of this approach is to distill example translations into reusable parts: fragments of sentences that we know how to translate robustly and are likely to recur. Individual words are certainly common enough to recur, but they often cannot be translated correctly in isolation. At the other extreme, whole sentences can be translated without much context, but rarely repeat, and so cannot be recycled to build new translations.

This thesis focuses on acquiring translations of phrases: contiguous sequences of a few words that encapsulate enough context to be translatable, but recur frequently in large corpora. We automatically identify phrase-level translations that are contained within human-translated sentences by partitioning each sentence into phrases and aligning phrases across languages. This alignment-based approach to acquiring phrasal translations gives rise to statistical models of phrase alignment.

The cumulative result of this thesis is to establish model-based phrase alignment as the most effective approach to acquiring phrasal translations. Only phrase alignment models are able to incorporate statistical signals about multi-word constructions into alignment decisions and score coherent phrasal analyses of full sentence pairs. As a result, phrase alignment models outperform classical word-level models in both generative and discriminative settings. This result is fundamental to the field: the models proposed in this thesis address a general, language-independent alignment problem that arises in all state-of-the-art statistical machine translation systems in use today.

Advisors: Daniel Klein


BibTeX citation:

@phdthesis{DeNero:EECS-2010-161,
    Author= {DeNero, John Sturdy},
    Title= {Phrase Alignment Models for Statistical Machine Translation},
    School= {EECS Department, University of California, Berkeley},
    Year= {2010},
    Month= {Dec},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-161.html},
    Number= {UCB/EECS-2010-161},
    Abstract= {The goal of a machine translation (MT) system is to automatically translate a document written in some human input language (e.g., Mandarin Chinese) into an equivalent document written in an output language (e.g., English).  This task---so simple in its specification, and yet so rich in its complexities---has challenged computer science researchers for 60 years.  While MT systems are in wide use today, the problem of producing human-quality translations remains unsolved.   

Statistical approaches have substantially improved the quality of MT systems by effectively exploiting parallel corpora: large collections of documents that have been translated by people, and therefore naturally occur in both the input and output languages.  Broadly characterized, statistical MT systems translate an input document by matching fragments of its contents to examples in a parallel corpus, and then stitching together the translations of those fragments into a coherent document in an output language.  

The central challenge of this approach is to distill example translations into reusable parts: fragments of sentences that we know how to translate robustly and are likely to recur.  Individual words are certainly common enough to recur, but they often cannot be translated correctly in isolation.  At the other extreme, whole sentences can be translated without much context, but rarely repeat, and so cannot be recycled to build new translations.

This thesis focuses on acquiring translations of phrases: contiguous sequences of a few words that encapsulate enough context to be translatable, but recur frequently in large corpora. We automatically identify phrase-level translations that are contained within human-translated sentences by partitioning each sentence into phrases and aligning phrases across languages.  This alignment-based approach to acquiring phrasal translations gives rise to statistical models of phrase alignment.

The cumulative result of this thesis is to establish model-based phrase alignment as the most effective approach to acquiring phrasal translations.  Only phrase alignment models are able to incorporate statistical signals about multi-word constructions into alignment decisions and score coherent phrasal analyses of full sentence pairs.  As a result, phrase alignment models outperform classical word-level models in both generative and discriminative settings.    This result is fundamental to the field: the models proposed in this thesis address a general, language-independent alignment problem that arises in all state-of-the-art statistical machine translation systems in use today.},
}

EndNote citation:

%0 Thesis
%A DeNero, John Sturdy 
%T Phrase Alignment Models for Statistical Machine Translation
%I EECS Department, University of California, Berkeley
%D 2010
%8 December 16
%@ UCB/EECS-2010-161
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-161.html
%F DeNero:EECS-2010-161