Hierarchical Phrase-Based Statistical Machine Translation System

نویسنده

  • Pushpak Bhattacharyya
چکیده

The aim of this thesis is to express fundamentals and concepts behind one of the emerging techniques in statistical machine translation (SMT) hierarchical phrase based MT by implementing translation from Hindi to English. Basically hierarchical model extends phrase based models by considering subphrases with the aid of context free grammar (CFG). In other models, syntax based models bear a resemblance to hierarchical models since the former requires corpus annotated with linguistic phrases like noun phrase, verb phrase. Hierarchical model overcomes this weakness of syntax based models since it does not require annotated corpora at all. Most Indian languages lack annotated corpus, so hierarchical models can prove to be handy in Indian to English translation. In terms of realtime implementation and translation quality, hierarchical model can coexist and even compete with state of the art MT systems. An accuracy of 0.16 (BLEU score) establishes the effectiveness of this approach for Hindi to English translation. Secondly, we discuss post editing techniques through implementation on the translation pipeline. Post editing techniques have recently emerged as a tool for improving quality of machine translation. In this thesis, we discuss translation for out of vocabulary (OOV) words, transliteration for named entities and grammar correction. OOV words are words that were not present in training data, but were present in test data. We deal with them using two approaches. Firstly, we check whether the word is a named entity and hence can be transliterated. Secondly, if a word is not a named entity, it is sent to the OOV module where it applies statistical technique like canonical correlation analysis (CCA) to translate an unknown Hindi word. The third approach that we discuss is grammar correction. Grammar correction can be considered as a translation problem from incorrect text to correct text. Grammar correction typically follows two approaches: rule based and statistical. Rule based approaches handle each error differently, and no uniform framework seems to be in place. We introduce a novel technique that uses hierarchical phrase-based statistical machine translation (SMT) for grammar correction. SMT systems provide a uniform platform for any sequence transformation task. Over the years, grammar correction data in electronic form has increased dramatically in quality and quantity making SMT systems feasible for grammar correction. Moreover, better translation models like hierarchical phrase-based SMT can handle errors as complicated as reordering or insertion which were difficult to deal with previously. Secondly, this SMT based correction technique is similar in spirit to human correction, because the system extracts grammar rules from the corpus and later uses these rules to translate incorrect sentences to correct sentences. We describe how to use Joshua, a hierarchical phrase-based SMT system for grammar correction. An accuracy of 0.77 (BLEU score) establishes the efficacy of our approach.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The RWTH Aachen German-English Machine Translation System for WMT 2014

This paper describes the statistical machine translation (SMT) systems developed at RWTH Aachen University for the German→English translation task of the ACL 2014 Eighth Workshop on Statistical Machine Translation (WMT 2014). Both hierarchical and phrase-based SMT systems are applied employing hierarchical phrase reordering and word class language models. For the phrase-based system, we run dis...

متن کامل

An Open-Source Hierarchical Phrase-Based Translation System

We present an open source translation system that provides a clean-room implementation of the hierarchical phrase-based statistical translation model introduced in (Chiang, 2005) and refined in (Chiang, 2007). To our knowledge this is the first freely available hierarchical phrase-based translation system which implements cube pruning. We introduce extensions to (Chiang, 2007) to take advantage...

متن کامل

A Lexicalized Reordering Model for Hierarchical Phrase-based Translation

Lexicalized reordering model plays a central role in phrase-based statistical machine translation systems. The reordering model specifies the orientation for each phrase and calculates its probability conditioned on the phrase. In this paper, we describe the necessity and the challenge of introducing such a reordering model for hierarchical phrase-based translation. To deal with the challenge, ...

متن کامل

NTT System Description for the WMT2006 Shared Task

We present two translation systems experimented for the shared-task of “Workshop on Statistical Machine Translation,” a phrase-based model and a hierarchical phrase-based model. The former uses a phrasal unit for translation, whereas the latter is conceptualized as a synchronousCFG in which phrases are hierarchically combined using non-terminals. Experiments showed that the hierarchical phraseb...

متن کامل

Statistical Machine Translation Based on Hierarchical Phrase Alignment

This paper describes statistical machine translation improved by applying hierarchical phrase alignment. The hierarchical phrase alignment is a method to align bilingual sentences phrase-by-phrase employing the partial parse results. Based on the hierarchical phrase alignment, a translation model is trained on a chunked corpus by converting hierarchically aligned phrases into a sequence of chun...

متن کامل

NTT statistical machine translation for IWSLT 2006

We present the NTT translation system that is experimented for the evaluation campaign of “International Workshop on Spoken Language Translation (IWSLT).” The system consists of two primary components: a hierarchical phrase-based statistical machine translation system and a reranking system. The former is conceptualized as a synchronous-CFG in which phrases are hierarchically combined using non...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013