Enriching Word Alignment with Linguistic Tags

نویسندگان

  • Xuansong Li
  • Niyu Ge
  • Stephen Grimes
  • Stephanie Strassel
  • Kazuaki Maeda
چکیده

Incorporating linguistic knowledge into word alignment is becoming increasingly important for current approaches in statistical machine translation research. To improve automatic word alignment and ultimately machine translation quality, an annotation framework is jointly proposed by LDC (Linguistic Data Consortium) and IBM. The framework enriches word alignment corpora to capture contextual, syntactic and language-specific features by introducing linguistic tags to the alignment annotation. Two annotation schemes constitute the framework: alignment and tagging. The alignment scheme aims to identify minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. The framework produces a solid ground-level alignment base upon which larger translation unit alignment can be automatically induced. To test the soundness of this work, evaluation is performed on a pilot annotation, resulting in interand intraannotator agreement of above 90%. To date LDC has produced manual word alignment and tagging on 32,823 Chinese-English sentences following this framework.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Enriching the Swedish Sign Language Corpus with Part of Speech Tags Using Joint Bayesian Word Alignment and Annotation Transfer

We have used a novel Bayesian model of joint word alignment and part of speech (PoS) annotation transfer to enrich the Swedish Sign Language Corpus with PoS tags. The annotations were then handcorrected in order to both improve annotation quality for the corpus, and allow the empirical evaluation presented herein.

متن کامل

O. Scrivner, T. Gilmanov SWIFT ALIGNER: A TOOL FOR THE VISUALIZATION AND CORRECTION OF WORD ALIGNMENT AND FOR CROSS LANGUAGE TRANSFER

It is well known that parallel corpora are valuable linguistic resources. One of the benefits of such corpora is that they allow for the building an annotated corpus for resource-poor languages via crosslanguage transfer. That is, given accurate alignment between a word from a source language and its equivalent in a target language, some linguistic information, such as part-of-speech tags or sy...

متن کامل

From keystrokes to annotated process data: Enriching the output of Inputlog with linguistic information

Keystroke logging tools are a valuable aid to monitor written language production. These tools record all keystrokes, including backspaces and deletions together with timing information. In this paper we report on an extension to the keystroke logging program Inputlog in which we aggregate the logged process data from the keystroke (character) level to the word level. The logged process data ar...

متن کامل

A Hybrid Word Alignment Approach to Improve Translation Lexicons with Compound Words and Idiomatic Expressions

In this paper, we present a hybrid approach to align single words, compound words and idiomatic expressions from bilingual parallel corpora. The objective is to develop, improve and maintain automatically translation lexicons. This approach combines linguistic and statistical information in order to improve word alignment results. The linguistic improvements taken into account refer to the use ...

متن کامل

Deriving Consensus for Multi-Parallel Corpora: an English Bible Study

What can you do with multiple noisy versions of the same text? We present a method which generates a single consensus between multi-parallel corpora. By maximizing a function of linguistic features between word pairs, we jointly learn a single corpus-wide multiway alignment: a consensus between 27 versions of the English Bible. We additionally produce English paraphrases, word-level distributio...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010