Approaches to enlarge bilingual corpora of example sentences to more languages in memoriam
نویسنده
چکیده
In a multilingual richly structured lexical data base such as Papillon, examples, citations, definitions and glosses expressed in each language have to be translated into all other languages and stored into the data base. Storing can be achieved in a simple and «!seamless!» way by introducing «!auxiliary» lexies and axies for these «!free language elements!». Translating all these elements into all languages is a necessary contribution to the Papillon project. In the TraCorpEx project, we pursue a more restricted goal: to enlarge bilingual corpora of parallel utterances to more languages, because some large such corpora have become freely available and one of them, the "Tanaka corpus" of Japanese-English sentence pairs, has been proven by J.!Breen to be useful as a source of examples while consulting JEDict. Hence, they can be also called "corpora of example sentences". The TraCorpEx methodology has 3 parts. (1) As such corpora almost always contain English, use several MT engines from English to new language L and an automatic comparaison scheme to get a "best suggestion" (or none) for translating each example in L. (2) Produce the "reference" translations in L by using a web-oriented, mutualization-based translation workbench implementing the «!Montaigne!» architecture to enable cooperative work of (preferably volunteer) translators. (3) If no MT engine is available for E-L, produce UNL graphs for the examples (manually, then interactively) and send them to a UNL-L deconverter. In the future, a «!coedition!» technique, still at the prototyping stage, could be used to improve the UNL graphs a posteriori and transparently from any language, and get improved translations in all target languages. Mots-clés!: Papillon multilingual data base, N-N translation of dictionary information, Montaigne architecture, aligned bilingual corpora, example sentences, multilingual sentence memory (MSM).
منابع مشابه
تأثیر ساختواژهها در تجزیه وابستگی زبان فارسی
Data-driven systems can be adapted to different languages and domains easily. Using this trend in dependency parsing was lead to introduce data-driven approaches. Existence of appreciate corpora that contain sentences and theirs associated dependency trees are the only pre-requirement in data-driven approaches. Despite obtaining high accurate results for dependency parsing task in English langu...
متن کاملFast and Accurate Sentence Alignment of Bilingual Corpora
We present a new method for aligning sentences with their translations in a parallel bilingual corpus. Previous approaches have generally been based either on sentence length or word correspondences. Sentence-length-based methods are relatively fast and fairly accurate. Word-correspondence-based methods are generally more accurate but much slower, and usually depend on cognates or a bilingual l...
متن کاملDisentangling from Babylonian Confusion - Unsupervised Language Identification
This work presents an unsupervised solution to language identification. The method sorts multilingual text corpora on the basis of sentences into the different languages that are contained and makes no assumptions on the number or size of the monolingual fractions. Evaluation on 7-lingual corpora and bilingual corpora show that the quality of classification is comparable to supervised approache...
متن کاملWho Is a Bilingual?
The question of who is and who is not a bilingual is more difficult to answer than it first appears. Bilingualism was long regarded as the equal mastery of two languages, a definition that still prevails in certain glossaries of linguistics. However, today's complex world requires a more exact definition and analysis of the competencies that community members require to interact with speakers o...
متن کاملOne model, two languages: training bilingual parsers with harmonized treebanks
We introduce an approach to train lexicalized parsers using bilingual corpora obtained by merging harmonized treebanks of different languages, producing parsers that can analyze sentences in either of the learned languages, or even sentences that mix both. We test the approach on the Universal Dependency Treebanks, training with MaltParser and MaltOptimizer. The results show that these bilingua...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002