Analysing the Effect of Out-of-Domain Data on SMT Systems

نویسندگان

  • Barry Haddow
  • Philipp Koehn
چکیده

In statistical machine translation (SMT), it is known that performance declines when the training data is in a different domain from the test data. Nevertheless, it is frequently necessary to supplement scarce in-domain training data with out-of-domain data. In this paper, we first try to relate the effect of the outof-domain data on translation performance to measures of corpus similarity, then we separately analyse the effect of adding the outof-domain data at different parts of the training pipeline (alignment, phrase extraction, and phrase scoring). Through experiments in 2 domains and 8 language pairs it is shown that the out-of-domain data improves coverage and translation of rare words, but may degrade the translation quality for more common words.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analysis of Temperature Effect on a Crystalline Silicon Photovoltaic Module Performance

In this paper, the effect of the cell-temperature on the performance of photovoltaic (PV) module is evaluated. The evaluation is based on a mathematical module (single diode equivalent circuit) and practically based on solar module tester (SMT). Solara®130W PV crystalline silicon module was used in this simulation. The SMT is able to supply a constant irradiance level (1000W/m2) or any other de...

متن کامل

Analysing the Stewardship Function in Botswana’s Health System: Reflecting on the Past, Looking to the Future

Background In many parts of the world, ongoing deficiencies in health systems compromise the delivery of health interventions. The World Health Organization (WHO) identified four functions that health systems need to perform to achieve their goals: Efforts to strengthen health systems focus on the way these functions are carried out. While a number of studies on health systems functions have be...

متن کامل

A Hybrid Machine Translation System Based on a Monotone Decoder

In this paper, a hybrid Machine Translation (MT) system is proposed by combining the result of a rule-based machine translation (RBMT) system with a statistical approach. The RBMT uses a set of linguistic rules for translation, which leads to better translation results in terms of word ordering and syntactic structure. On the other hand, SMT works better in lexical choice. Therefore, in our sys...

متن کامل

Experiments on Domain Adaptation for English--Hindi SMT

Statistical Machine Translation (SMT) systems are usually trained on large amounts of bilingual text and monolingual target language text. If a significant amount of out-of-domain data is added to the training data, the quality of translation can drop. On the other hand, training an SMT system on a small amount of training material for given indomain data leads to narrow lexical coverage which ...

متن کامل

A Simplification-Translation-Restoration Framework for Cross-Domain SMT Applications

Integration of domain specific knowledge into a general purpose statistical machine translation (SMT) system poses challenges due to insufficient bilingual corpora. In this paper we propose a simplification-translation-restoration (STR) framework for domain adaptation in SMT by simplifying domain specific segments of a text. For an in-domain text, we identify the critical segments and modify th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012