A Fast And Accurate Method For Detecting English-Japanese Parallel Texts
نویسندگان
چکیده
Parallel corpus is a valuable resource used in various fields of multilingual natural language processing. One of the most significant problems in using parallel corpora is the lack of their availability. Researchers have investigated approaches to collecting parallel texts from the Web. A basic component of these approaches is an algorithm that judges whether a pair of texts is parallel or not. In this paper, we propose an algorithm that accelerates this task without losing accuracy by preprocessing a bilingual dictionary as well as the collection of texts. This method achieved 250,000 pairs/sec throughput on a single CPU, with the best F1 score of 0.960 for the task of detecting 200 Japanese-English translation pairs out of 40, 000. The method is applicable to texts of any format, and not specific to HTML documents labeled with URLs. We report details of these preprocessing methods and the fast comparison algorithm. To the best of our knowledge, this is the first reported experiment of extracting Japanese– English parallel texts from a large corpora based solely on linguistic content.
منابع مشابه
Building An MT Dictionary From Parallel Texts Based On Linguistic And Statistical Information
A method for generating a machine translation (MT) dictionary from parallel texts is described. This method utilizes both statistical information and linguistic information to obtain corresponding words or phrases in parallel texts. By combining these two types of information, translation pairs which cannot be obtained by a linguistic-based method can be extntcted. Over 70% accurate translation...
متن کاملBUILDING AN MT I)ICTIONARY FROM PARAI~LEI~ TEXTS BASED ON LINGUISTIC AND STATISTICAL INIi'ORMATION
A method for generating a machine translation (MT) dictionary from parallel texts is described. This method utilizes both statistical information and linguistic information to obtain corresponding words or phrases in parallel texts. By combining these two types of information, translation pairs which cannot be obtained by a linguistic-based method can be extntcted. Over 70% accurate translation...
متن کاملComparing k-means clusters on parallel Persian-English corpus
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...
متن کاملMining Parallel Texts from Mixed-Language Web Pages
We propose to mine parallel texts from mixedlanguage web pages. We define a mixedlanguage web page as a web page consisting of (at least) two languages. We mined Japanese-English parallel texts from mixedlanguage web pages. We presented the statistics for extracted parallel texts and conducted machine translation experiments. These statistics and experiments showed that mixedlanguage web pages ...
متن کاملDisambiguation of Single Noun Translations Extracted from Bilingual Comparable Corpora
s of papers of four academic societies, namely Japan Architecture Society (JAS), Institute of Electric Engineering (IEE), Institute of Electronics and Communication Engineering (IECE), and Information Processing Society of Japan (IPSJ), published in Japan. Numbers of abstracts of each of these corpora are shown in Table 1. Parts of these bilingual corpora are parallel. The percentages of parall...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006