EVBCorpus - A Multi-Layer English-Vietnamese Bilingual Corpus for Studying Tasks in Comparative Linguistics

نویسندگان

  • Quoc Hung Ngo
  • Werner Winiwarter
  • Bartholomäus Wloka
چکیده

Bilingual corpora play an important role as resources not only for machine translation research and development but also for studying tasks in comparative linguistics. Manual annotation of word alignments is of significance to provide a gold-standard for developing and evaluating machine translation models and comparative linguistics tasks. This paper presents research on building an English-Vietnamese parallel corpus, which is constructed for building a Vietnamese-English machine translation system. We describe the specification of collecting data for the corpus, linguistic tagging, bilingual annotation, and the tools specially developed for the manual annotation. An English-Vietnamese bilingual corpus of over 800,000 sentence pairs and 10,000,000 English words as well as Vietnamese words has been collected and aligned at the sentence level, and over 45,000 sentence pairs of this corpus have been aligned at the word level. Moreover, the 45,000 sentence pairs have been tagged using other linguistics tags, including word segmentation for Vietnamese text, chunker and named entity tags.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sixth International Joint Conference on Natural Language Processing Proceedings of the 11th Workshop on Asian Language Resources

Bilingual corpora play an important role as resources not only for machine translation research and development but also for studying tasks in comparative linguistics. Manual annotation of word alignments is of significance to provide a gold-standard for developing and evaluating machine translation models and comparative linguistics tasks. This paper presents research on building an English-Vi...

متن کامل

POS-Tagger for English-Vietnamese Bilingual Corpus

Corpus-based Natural Language Processing (NLP) tasks for such popular languages as English, French, etc. have been well studied with satisfactory achievements. In contrast, corpus-based NLP tasks for unpopular languages (e.g. Vietnamese) are at a deadlock due to absence of annotated training data for these languages. Furthermore, hand-annotation of even reasonably well-determined features such ...

متن کامل

Building English-Vietnamese Named Entity Corpus with Aligned Bilingual News Articles

Named entity recognition aims to classify words in a document into pre-defined target entity classes. It is now considered to be fundamental for many natural language processing tasks such as information retrieval, machine translation, information extraction and question answering. This paper presents a workflow to build an English-Vietnamese named entity corpus from an aligned bilingual corpus...

متن کامل

Building A Training Corpus For Word Sense Disambiguation In English-To-Vietnamese Machine Translation

The most difficult task in machine translation is the elimination of ambiguity in human languages. A certain word in English as well as Vietnamese often has different meanings which depend on their syntactical position in the sentence and the actual context. In order to solve this ambiguation, formerly, people used to resort to many hand-coded rules. Nevertheless, manually building these rules ...

متن کامل

Automatic Searching for English-Vietnamese Documents on the Internet

Bilingual corpora together with machine learning technology can be used to solve problems in natural language processing. In addition, bilingual corpora are useful for mapping linguistic tags of less popular languages, such as Vietnamese, and for studying comparative linguistics. However, Vietnamese corpora still have some shortcomings, especially English–Vietnamese bilingual corpora. This pape...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013