A Fast And Accurate Method For Detecting English-Japanese Parallel Texts

نویسندگان

Kenichi Fukushima

Kenjiro Taura

Takashi Chikayama

چکیده

Parallel corpus is a valuable resource used in various fields of multilingual natural language processing. One of the most significant problems in using parallel corpora is the lack of their availability. Researchers have investigated approaches to collecting parallel texts from the Web. A basic component of these approaches is an algorithm that judges whether a pair of texts is parallel or not. In this paper, we propose an algorithm that accelerates this task without losing accuracy by preprocessing a bilingual dictionary as well as the collection of texts. This method achieved 250,000 pairs/sec throughput on a single CPU, with the best F1 score of 0.960 for the task of detecting 200 Japanese-English translation pairs out of 40, 000. The method is applicable to texts of any format, and not specific to HTML documents labeled with URLs. We report details of these preprocessing methods and the fast comparison algorithm. To the best of our knowledge, this is the first reported experiment of extracting Japanese– English parallel texts from a large corpora based solely on linguistic content.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building An MT Dictionary From Parallel Texts Based On Linguistic And Statistical Information

A method for generating a machine translation (MT) dictionary from parallel texts is described. This method utilizes both statistical information and linguistic information to obtain corresponding words or phrases in parallel texts. By combining these two types of information, translation pairs which cannot be obtained by a linguistic-based method can be extntcted. Over 70% accurate translation...

متن کامل

BUILDING AN MT I)ICTIONARY FROM PARAI~LEI~ TEXTS BASED ON LINGUISTIC AND STATISTICAL INIi'ORMATION

متن کامل

Comparing k-means clusters on parallel Persian-English corpus

This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...

متن کامل

Mining Parallel Texts from Mixed-Language Web Pages

We propose to mine parallel texts from mixedlanguage web pages. We define a mixedlanguage web page as a web page consisting of (at least) two languages. We mined Japanese-English parallel texts from mixedlanguage web pages. We presented the statistics for extracted parallel texts and conducted machine translation experiments. These statistics and experiments showed that mixedlanguage web pages ...

متن کامل

Disambiguation of Single Noun Translations Extracted from Bilingual Comparable Corpora

s of papers of four academic societies, namely Japan Architecture Society (JAS), Institute of Electric Engineering (IEE), Institute of Electronics and Communication Engineering (IECE), and Information Processing Society of Japan (IPSJ), published in Japan. Numbers of abstracts of each of these corpora are shown in Table 1. Parts of these bilingual corpora are parallel. The percentages of parall...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

A Fast And Accurate Method For Detecting English-Japanese Parallel Texts

نویسندگان

چکیده

منابع مشابه

Building An MT Dictionary From Parallel Texts Based On Linguistic And Statistical Information

BUILDING AN MT I)ICTIONARY FROM PARAI~LEI~ TEXTS BASED ON LINGUISTIC AND STATISTICAL INIi'ORMATION

Comparing k-means clusters on parallel Persian-English corpus

Mining Parallel Texts from Mixed-Language Web Pages

Disambiguation of Single Noun Translations Extracted from Bilingual Comparable Corpora

عنوان ژورنال:

اشتراک گذاری