Extracting Parallel Fragments from Comparable Corpora for Data-to-text Generation

نویسندگان

  • Anja Belz
  • Eric Kow
چکیده

Building NLG systems, in particular statistical ones, requires parallel data (paired inputs and outputs) which do not generally occur naturally. In this paper, we investigate the idea of automatically extracting parallel resources for data-to-text generation from comparable corpora obtained from the Web. We describe our comparable corpus of data and texts relating to British hills and the techniques for extracting paired input/output fragments we have developed so far.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

Multimodal Comparable Corpora as Resources for Extracting Parallel Data: Parallel Phrases Extraction

Discovering parallel data in comparable corpora is a promising approach for overcoming the lack of parallel texts in statistical machine translation and other NLP applications. In this paper we propose an alternative to comparable corpora of texts as resources for extracting parallel data: a multimodal comparable corpus of audio and texts. We present a novel method to detect parallel phrases fr...

متن کامل

Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora

We present a novel method for extracting parallel sub-sentential fragments from comparable, non-parallel bilingual corpora. By analyzing potentially similar sentence pairs using a signal processinginspired approach, we detect which segments of the source sentence are translated into segments in the target sentence, and which are not. This method enables us to extract useful machine translation ...

متن کامل

A Generative Model for Extracting Parallel Fragments from Comparable Documents

Although parallel corpora are essential language resources for many NLP tasks, they are rare or even not available for many language pairs. Instead, comparable corpora are widely available and contain parallel fragments of information that can be used applications like statistical machine translations. In this research, we propose a generative LDA based model for extracting parallel fragments f...

متن کامل

Improving MT System Using Extracted Parallel Fragments of Text from Comparable Corpora

In this article, we present an automated approach of extracting English-Bengali parallel fragments of text from comparable corpora created using Wikipedia documents. Our approach exploits the multilingualism of Wikipedia. The most important fact is that this approach does not need any domain specific corpus. We have been able to improve the BLEU score of an existing domain specific EnglishBenga...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010