We develop state-of-the-art phrase-based statistical MT systems that translate text or speech from one language into another. An MT system consists of two core components:
The translation model is trained on sentence-aligned human translations, as they are e.g. available for the speeches of the European Parliament.
In the training process the system analyzes existing translations to learn translation patterns. Statistics about different translations for words and word sequences are collected in order to estimate probabilities. The pairs of source and target language phrases together with their probabilities are stored in a phrase table that constitutes the translation model.
The second core component of a machine translation system is the language model. This model is trained on a large portion of monolingual text. The system analyses well-formed text in order to learn statistics on words, as well as the contexts and the order they appear in.
In order to translate new sentences, the decoder performs a search to find the most probable translation for each source sentence according to translation model and language model.
Research Topics and Challenges
- Word Order: When translating from one language to another, the word order of source and target language sentences often differentiates. Our SMT system includes an additional reordering model based on the part of speech of the source language words to control this important feature of language.
- Domain Adaptation: The quality of a machine translation is the better, the closer the training data is to the target task. However, there might not always be enough data to rebuild a complete SMT system only on in-domain data. Our system is able to exploit few in-domain data to adapt a general purpose SMT system to a given domain.
- Data Acquisition / Comparable Corpora: In order to support domain adaptation techniques we develop methods to acquire in-domain data for new domains, e.g. from multilingual web sites. For domains where there is no real parallel data available, comparable data, such as news reports by newspapers in different countries about the same event, can be used as a substitute but need special algorithms.
- Grammaticality: In spite of the language model, SMT often produces ungrammatical output. External language tools such as parsers and taggers can provide additional linguistic information that can be exploited to enforce e.g. agreement between subject and verb or correct syntactic structure of the sentence.