Bilingual Sentence Alignment Based on Punctuation Marks
نویسنده
چکیده
We present a new approach to aligning English and Chinese sentences in parallel corpora based solely on punctuations. Although the length based approach produces high accuracy rates of sentence alignment for clean parallel corpora written in two Western languages such as French-English and German-English, it does not fair as well for parallel corpora that are noisy or written in two distant languages such as Chinese-English. It is possible to use cognates on top of length-based approach to increase alignment accuracy. However, cognates do not exist between two distant languages, therefore limiting the applicability of cognate-based approach. In this paper, we examine the feasibility of using punctuations for high accuracy sentence alignment. We have experimented with an implementation of the proposed method on the parallel corpus of Chinese-English Sinorama Magazine Corpus with satisfactory results. We also demonstrated that the method was applicable to other language pairs such as English-Japanese with minimal additional effort.
منابع مشابه
Aligning Parallel Bilingual Corpora Statistically with Punctuation Criteria
We present a new approach to aligning sentences in bilingual parallel corpora based on punctuation, especially for English and Chinese. Although the length-based approach produces high accuracy rates of sentence alignment for clean parallel corpora written in two Western languages, such as French-English or German-English, it does not work as well for parallel corpora that are noisy or written ...
متن کاملA Part-of-Speech-Based Alignment Algorithm
1. I n t r o d u c t i o n Real texts provide the alive phenomena, usages, and tendency of langnage in a parlictflar space and time. This recommends us to do the researches on the corpora. Recently, many rese~{rchers timber claim that "two languages art more informative than one" (Dagan, 1991). They show that two languages coukl disambigna.te each other (Gale e¢ al., 1992); bilingual corpus cou...
متن کاملSentence Alignment of Historical Classics based on Mode Prediction and Term Translation Pairs
Parallel corpora are essential resources for the construction of bilingual term dictionary of historical classics. To obtain large-scale parallel corpora, this paper proposes a sentence alignment method based on mode prediction and term translation pairs. On one hand, the method rebuilds the sentence alignment process according to characteristics of the translation of historical classics, and a...
متن کاملChinese-Uyghur Sentence Alignment: An Approach Based on Anchor Sentences
This paper, which builds on previous studies on sentence alignment, introduces a sentence alignment method in which some sentences are used as “anchors” and a two step procedure is applied. In the first step, some lexical information such as proper names, technical terms, numbers and punctuation marks, location information and length information are used to generate anchor sentences that satisf...
متن کاملAn Automatic Punctuation Marks System For Arabic Texts
This work presents a system for Automatic Arabic punctuation marks. Existing approaches for automatic punctuation marks do not provide suitable performance for and do not satisfy user interests in Arabic texts. The importance and rising need to automate the correct insertion of punctuation marks in Arabic texts led to a need of specific analysis of the Arabic language to introduce approaches th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003