Parallel Corpora based Translation Resources Extraction

نویسندگان

  • Alberto Simões
  • José João Almeida
چکیده

This paper describes NATools, a toolkit to process, analyze and extract translation resources from Parallel Corpora. It includes tools like a sentence-aligner, a probabilistic translation dictionaries extractor, word-aligner, a corpus server, a set of tools to query corpora and dictionaries, as well as a set of tools to extract bilingual resources.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

Bilingual Terminology Extraction based on Translation Patterns

Parallel corpora are rich sources of translation resources. This document presents a methodology for the extraction of bilingual nominals (terminology candidates) from parallel corpora, using translation patterns. The patterns proposed in this work specify the order changes that occur during translation and that are intrinsic to the involved languages syntaxes. These patterns are described in a...

متن کامل

Measuring Comparability of Documents in Non-Parallel Corpora for Efficient Extraction of (Semi-)Parallel Translation Equivalents

In this paper we present and evaluate three approaches to measure comparability of documents in non-parallel corpora. We develop a task-oriented definition of comparability, based on the performance of automatic extraction of translation equivalents from the documents aligned by the proposed metrics, which formalises intuitive definitions of comparability for machine translation research. We de...

متن کامل

Multimodal Comparable Corpora as Resources for Extracting Parallel Data: Parallel Phrases Extraction

Discovering parallel data in comparable corpora is a promising approach for overcoming the lack of parallel texts in statistical machine translation and other NLP applications. In this paper we propose an alternative to comparable corpora of texts as resources for extracting parallel data: a multimodal comparable corpus of audio and texts. We present a novel method to detect parallel phrases fr...

متن کامل

ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora

The lack of parallel corpora and linguistic resources for many languages and domains is one of the major obstacles for the further advancement of automated translation. A possible solution is to exploit comparable corpora (non-parallel bior multi-lingual text resources) which are much more widely available than parallel translation data. Our presented toolkit deals with parallel content extract...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Procesamiento del Lenguaje Natural

دوره 39  شماره 

صفحات  -

تاریخ انتشار 2007