Ways to Improve the Quality of English-Czech Machine Translation

نویسنده

  • Martin Popel
چکیده

This thesis describes English-Czech Machine Translation as it is implemented in TectoMT system. The transfer uses deep-syntactic dependency (tectogrammatical) trees and exploits the annotation scheme of Prague Dependency Treebank. The primary goal of the thesis is to improve the translation quality using both rule-base and statistical methods. First, we present a manual annotation of translation errors in 250 sentences and subsequent identification of frequent errors, their types and sources. The main part of the thesis describes the design and implementation of modifications in the three transfer phases: analysis, transfer and synthesis. The most prominent modification is a novel approach to the transfer phase based on Hidden Markov Tree Models (a tree modification of Hidden Markov Models). The improvements are evaluated in terms of BLEU and NIST scores.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving SMT by Using Parallel Data of a Closely Related Language

The amount of training data in statistical machine translation critically affects translation quality. In this paper, we demonstrate how to increase translation quality for one language pair by introducing parallel data from a closely related language. Specifically, we improve English→Slovak translation using a large Czech– English parallel corpus and a shallow MT system for Czech→Slovak transl...

متن کامل

Utilization of Anaphora in Machine Translation

Majority of present machine translation systems do not address the retaining of text coherency, they translate just isolated sentences. On the other hand, the authors of anaphora resolvers rarely integrate these tools into more complex scenarios, e.g. the task of machine translation. We propose the ways how machine translation systems can utilize the knowledge of anaphoric relations both in the...

متن کامل

Selecting Data for English-to-Czech Machine Translation

We provide a few insights on data selection for machine translation. We evaluate the quality of the new CzEng 1.0, a parallel data source used in WMT12. We describe a simple technique for reducing out-of-vocabulary rate after phrase extraction. We discuss the benefits of tuning towards multiple reference translations for English-Czech language pair. We introduce a novel approach to data selecti...

متن کامل

Statistical Machine Translation Between Related and Unrelated Languages

In this paper we describe an attempt to compare how relatedness of languages can influence the performance of statistical machine translation (SMT). We apply the Moses toolkit on the Czech-English-Russian corpus UMC 0.1 in order to train two translation systems: Russian-Czech and English-Czech. The quality of the translation is evaluated on an independent test set of 1000 sentences parallel in ...

متن کامل

English-to-Czech Factored Machine Translation

This paper describes experiments with English-to-Czech phrase-based machine translation. Additional annotation of input and output tokens (multiple factors) is used to explicitly model morphology. We vary the translation scenario (the setup of multiple factors) and the amount of information in the morphological tags. Experimental results demonstrate significant improvement of translation qualit...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009